-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unrecognized file repository pointer for private dataset in ebrains #58
Comments
We tried the same command today and got a code 500 error: |
Hey, thanks for giving it a go! re #58 (comment): it looks a bit as if this was attempt with code prior 736f542 -- if that is true, then updating to the most recent dev-snapshot should fix this particular issues. Please let me know.
Sadly, this looks like #36 -- there is no fix that I am aware of other than time. This situation typically lasts for a few days, and then the query endpoint (I assume) comes back to life. I can replicate the behavior you are seeing. There error is happening here:
where As you can see, both queries in the snippet run through However, as this is only happening occasionally, albeit still annoyingly frequent, neither Maybe you could consider bringing this up in https://github.com/HumanBrainProject/fairgraph/issues, or some ebrains support channel? |
Oh, looking at the test runs of #59 from Mar 3, it seems that the outage is already a few days long. That is the longest observed so far. |
Thank you for the pointers, I'll try using the latest version! Just to let you know, we also tried accessing the aforementioned dataset today using |
I've looked into this a bit with Oliver Schmid, the KG product owner. It seems likely that this problem originates because datalad is talking to the pre-production KG server (kg-ppd). This is the default for fairgraph (the motivation being that people should test their scripts against PPD before running against the production server), but this is not well documented, for which I apologise. The fix would be here: https://github.com/datalad/datalad-ebrains/blob/main/datalad_ebrains/fairgraph_query.py#L34
|
Oh great, thanks a lot for the investigation! |
Thx @apdavison for determining the cause! @alexisthual If you want to prep a quick PR that would be much appreciated. I have put it on my TODO otherwise. TIA! |
Unfortunately, even when I use the latest commits pushed on I tried different commands:
All 3 commands yield the same error as the one I reported in the first message of this issue:
Moreover, trying to access the link present in the error from my browser yields
which is probably normal since I didn't explicitly provide a token. |
Thanks for looking into it. I had a closer look, and the dataset's files are hosted "behind" the human data gateway. To my knowledge, the is no programmatic way to access such data directly. It involves requesting access by clicking a button on the web UI, receiving an email, clicking a link in that email. Because of these complications, I had not attempted to check if a programmatic access is possible afterwards (also because the access permissions only last for 24h, so testing such functionality on a CI is not easily possible). I have now requested and received access to this dataset, and will have a look. |
This change is merely adding the ability to recognize and process non-public dataset data-proxy URLs. However, it is not enough to support such datasets, because the underlying `fairgraph` query to get a dataset's file listing returns no results. The query is essentially this ```py batch = omcore.File.list( self.client, file_repository=dvr, size=chunk_size, from_index=cur_index) ``` and for the dataset referenced in #58 it returns an empty list with - a properly authenticated `client` - `dvr`: `FileRepository(name='buckets/d-07ab1665-73b0-40c5-800e-557bc319109d', iri=IRI(https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d)...` - `chunk_size`: 10000 - `cur_index`: 0 With the same requesting account, I can browser-visit https://data-proxy.ebrains.eu/datasets/07ab1665-73b0-40c5-800e-557bc319109d and see a file listing.
I have posted #61 with a code sketch and my findings. After rectifying superficial issues in If you happened to have any insight in this, please let me know. Thx! |
Thanks for looking into this Michael! |
AFAIK |
Hello, I'm not sure how Full code to reproduce the data fetching described above: import siibra
siibra.fetch_ebrains_token() from siibra.retrieval.repositories import EbrainsHdgConnector
dataset_id = "07ab1665-73b0-40c5-800e-557bc319109d" # The ID is the last part of the url
conn = EbrainsHdgConnector(dataset_id)
conn.search_files() data_file = "resulting_smooth_maps/sub-01/ses-14/sub-01_ses-14_task-MTTWE_dir-ap_space-MNI152NLin2009cAsym_desc-preproc_ZMap-we_all_event_response.nii.gz"
img = conn.get(data_file) |
@ymzayek Thanks for the code snippet. That is very helpful. We should be able to reuse the auth-setup in I am not sure whether a non-public data-proxy bucket link always implies the human data gateway, but until we discover counter-evidence, this may be good enough. |
So looking at https://github.com/FZJ-INM1-BDA/siibra-python/blob/908f118f87ec83def2970d9a526f29f49482e2bc/siibra/retrieval/repositories.py#L354-L449 I see that Now I am wondering: We could do the same thing. Moreoever, doing it not only for non-public datasets, like the example here, but for any data-proxy accessible dataset may actually solve #52. If that is True, it would boost overall performance by quite a bit! |
@mih, @ymzayek and I are interested to look more into this but it feels a bit hard to dive into this codebase on our own. |
@alexisthual @ymzayek That would be wonderful. We have a regular zoom call for such things Tue's 8:30 CET. If this would work for you, that would be the easiest, and @dickscheid would also be in that call. |
Nice! 8:30 am might be a bit early (the office is rather far haha) but I think I can try and make it next Tuesday 🙂 |
I think I should be able to make it for next Tuesday as well. |
Awesome! Apologies for the timing. This is pretty much 11am if-there-would-be-nothing-stupid-to-do o'clock. Please shoot me an email at [email protected], and I will send you a zoom link. Thx for your interest! |
#61 has progressed a bit with today's meeting, but is not yet in a functional state. @dickscheid pointed out the HDG documentation should have all missing information It might require a dedicated implementation of a downloader. This should be fairly straightforward with the |
Hi @mih and @dickscheid ! |
Hi! We've (@man-shu @ferponce @bthirion) tried using We did not try to integrate these changes in |
Hi @mih! |
This change is merely adding the ability to recognize and process non-public dataset data-proxy URLs. However, it is not enough to support such datasets, because the underlying `fairgraph` query to get a dataset's file listing returns no results. The query is essentially this ```py batch = omcore.File.list( self.client, file_repository=dvr, size=chunk_size, from_index=cur_index) ``` and for the dataset referenced in #58 it returns an empty list with - a properly authenticated `client` - `dvr`: `FileRepository(name='buckets/d-07ab1665-73b0-40c5-800e-557bc319109d', iri=IRI(https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d)...` - `chunk_size`: 10000 - `cur_index`: 0 With the same requesting account, I can browser-visit https://data-proxy.ebrains.eu/datasets/07ab1665-73b0-40c5-800e-557bc319109d and see a file listing.
I had the chance to work on this again. #61 refactors the code to allow for interacting with the data proxy API directly. Moreover, it switches access for publicly hosted datasets that are accessible via the DP to use that API too. I could not get the authentication flow for private dataset access via the HDG to work -- neither in code, nor with https://data-proxy.ebrains.eu/api/docs I use my EBRAINS token to authenticate. When I POST to {
"status_code": 401,
"detail": "User not authenticated properly. The token needs to access the 'roles', 'email', 'team' and 'profile' scope."
} The correpsonding GET request fails (as expected) with {
"status_code": 401,
"detail": "Access has expired, please request access again",
"can_request_access": true
} This makes me think that either the EBRAINS session token is the wrong credential here, or that my particular account is insufficient, or I am missing a crucial step in the authorization flow. @alexisthual if you can get a file listing of a HDG dataset via https://data-proxy.ebrains.eu/api/docs please let me know how, and I am confident that I can achieve the rest. |
Not sure this is helpful but I also tried this. From the browser I can access this private dataset https://data-proxy.ebrains.eu/datasets/07ab1665-73b0-40c5-800e-557bc319109d (authorized through login and email link). Then using that same token: curl -X 'POST' \
'https://data-proxy.ebrains.eu/api/v1/datasets/07ab1665-73b0-40c5-800e-557bc319109d' \
-H 'accept: application/json' \
-H 'Authorization: Bearer $TOKEN' \
-d '' I get {
"error": "Error accessing userinfo"
} And same response with GET request |
Hi!
First thank you for the nice extension 😊
We (@bthirion @ferponcem @man-shu @ymzayek) are interested in downloading this dataset from ebrains: https://search.kg.ebrains.eu/instances/07ab1665-73b0-40c5-800e-557bc319109d
Although we authenticated with
export KG_AUTH_TOKEN=`datalad ebrains-authenticate`
, we still could not get the following command to work:datalad ebrains-clone 07ab1665-73b0-40c5-800e-557bc319109d ibc-test
The traceback is the following:
Maybe we're missing something here. Happy to contribute to the docs if someone can help us find a solution to this!
Thanks
The text was updated successfully, but these errors were encountered: