-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Querying FileRepository for files is unreasonably slow #57
Comments
I think this time is determined mainly by the KG Query API, not by fairgraph itself, and is probably related to the very large number of Files in the KG (large compared to the number of dataset versions). I'll ask the KG team to look into it. |
I did some profiling of the following script, for a repository with 4 files:
and it certainly seems all the time is spent in (for reference, the query id used is https://kg.ebrains.eu/api/instances/dcc56635-48a9-4328-b51a-5f2ad1d5af1e)
|
I am currently working with https://search.kg.ebrains.eu/instances/900a1c2d-4914-42d5-a316-5472afca0d90 I am using the following implementation to iterate over files in a file repository: def iter_files(self, dvr, chunk_size=100):
cur_index = 0
while True:
batch = omcore.File.list(
self.client,
file_repository=dvr,
size=chunk_size,
from_index=cur_index)
for f in batch:
yield f
if len(batch) < chunk_size:
# there is no point in asking for another batch
return
cur_index += len(batch) It takes about 8 minutes to get all 1678 file records in this dataset. My choice of Are their any empirical data on what is a good trade-off? Thx! |
The default size (also in fairgraph) is 100 items. However, each batch of 100 file records imposes a 30s latency cost. For datasets with thousands of files, this quickly turns into hours. I have asked at for feedback on acceptable chunk sizes here: HumanBrainProject/fairgraph#57 (comment) In the meantime, let's go with 10k. This reduces the processing time of the Jülich Brain Atlas v3.0 from 8min to 30s (1678 files).
The problem with this is that the generic query produced by fairgraph is not optimal for this case (please note that "files" are more difficult to handle because there are plenty). A more efficient way of filtering the files for a specific file repository is to actually start at the file repository level and filter by instance id: Running the following query with the reported instanceId "3a31dd9f-d12b-44e0-90cf-c131e9be580b" returns the 4 involved files in ~50ms
|
I have finally found time to implement queries that cross links in the graph, so you can now do this:
which gives 1678 files. This takes a couple of seconds or less in fairgraph 0.11.0. |
Here is what I do in a nutshell
this takes 20+ seconds, getting a single file record, reliably (median 25s). For comparison, querying for all dataset versions of a particular dataset takes less than a second over the same connection.
Within the ranges a tested the value of
limit
has no impact on the latency. Even when movingfrom_index
to a value larger than the number of files in the repository (to get an empty result list), it takes the same time.Is there a faster way to get a file listing? Maybe some kind of iterator?
The text was updated successfully, but these errors were encountered: