Querying FileRepository for files is unreasonably slow #57

mih · 2022-12-02T11:55:50Z

Here is what I do in a nutshell

omcore.File.list(fq.client, file_repository=dvr, limit=1, from_index=0)

this takes 20+ seconds, getting a single file record, reliably (median 25s). For comparison, querying for all dataset versions of a particular dataset takes less than a second over the same connection.

Within the ranges a tested the value of limit has no impact on the latency. Even when moving from_index to a value larger than the number of files in the repository (to get an empty result list), it takes the same time.

Is there a faster way to get a file listing? Maybe some kind of iterator?

The text was updated successfully, but these errors were encountered:

apdavison · 2022-12-06T13:51:48Z

I think this time is determined mainly by the KG Query API, not by fairgraph itself, and is probably related to the very large number of Files in the KG (large compared to the number of dataset versions). I'll ask the KG team to look into it.

apdavison · 2022-12-06T14:18:01Z

I did some profiling of the following script, for a repository with 4 files:

from fairgraph import KGClient
import fairgraph.openminds.core as omcore


client = KGClient(host="core.kg.ebrains.eu")

dvr_id = "https://kg.ebrains.eu/api/instances/3a31dd9f-d12b-44e0-90cf-c131e9be580b"

files = omcore.File.list(client, file_repository=dvr_id, limit=1, from_index=0)

print(len(files))

and it certainly seems all the time is spent in kg_core/kg.py:429(execute_query_by_id)

(for reference, the query id used is https://kg.ebrains.eu/api/instances/dcc56635-48a9-4328-b51a-5f2ad1d5af1e)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    488/1    0.000    0.000   23.646   23.646 {built-in method builtins.exec}
        1    0.000    0.000   23.645   23.645 query_files.py:1(<module>)
        3    0.000    0.000   23.385    7.795 /Users/adavison/dev/data/env/lib/python3.9/site-packages/requests/api.py:14(request)
        3    0.000    0.000   23.384    7.795 /Users/adavison/dev/data/env/lib/python3.9/site-packages/requests/sessions.py:500(request)
        3    0.000    0.000   23.374    7.791 /Users/adavison/dev/data/env/lib/python3.9/site-packages/requests/sessions.py:671(send)
        3    0.000    0.000   23.374    7.791 /Users/adavison/dev/data/env/lib/python3.9/site-packages/requests/adapters.py:436(send)
        3    0.000    0.000   23.372    7.791 /Users/adavison/dev/data/env/lib/python3.9/site-packages/urllib3/connectionpool.py:522(urlopen)
        3    0.000    0.000   23.371    7.790 /Users/adavison/dev/data/env/lib/python3.9/site-packages/urllib3/connectionpool.py:361(_make_request)
        1    0.000    0.000   23.321   23.321 /Users/adavison/dev/data/fairgraph/fairgraph/base_v3.py:506(list)
        2    0.000    0.000   23.315   11.658 /Users/adavison/dev/data/env/lib/python3.9/site-packages/kg_core/__communication.py:170(_get)
        2    0.000    0.000   23.315   11.658 /Users/adavison/dev/data/env/lib/python3.9/site-packages/kg_core/__communication.py:130(_request)
        2    0.000    0.000   23.315   11.658 /Users/adavison/dev/data/env/lib/python3.9/site-packages/kg_core/__communication.py:139(_do_request)
        3    0.000    0.000   23.212    7.737 /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py:1305(getresponse)
        3    0.000    0.000   23.212    7.737 /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py:309(begin)
       77    0.000    0.000   23.210    0.301 {method 'readline' of '_io.BufferedReader' objects}
        6    0.000    0.000   23.210    3.868 /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/socket.py:690(readinto)
        3    0.000    0.000   23.210    7.737 /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py:276(_read_status)
        6    0.000    0.000   23.210    3.868 /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/ssl.py:1230(recv_into)
        6    0.000    0.000   23.210    3.868 /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/ssl.py:1090(read)
        6   23.210    3.868   23.210    3.868 {method 'read' of '_ssl._SSLSocket' objects}
        1    0.000    0.000   23.175   23.175 /Users/adavison/dev/data/fairgraph/fairgraph/client_v3.py:116(query)
        1    0.000    0.000   23.175   23.175 /Users/adavison/dev/data/fairgraph/fairgraph/client_v3.py:118(_query)
        1    0.000    0.000   23.175   23.175 /Users/adavison/dev/data/env/lib/python3.9/site-packages/kg_core/kg.py:429(execute_query_by_id)
    587/2    0.001    0.000    0.249    0.125 <frozen importlib._bootstrap>:1002(_find_and_load)

mih · 2023-02-06T08:53:41Z

I am currently working with https://search.kg.ebrains.eu/instances/900a1c2d-4914-42d5-a316-5472afca0d90

I am using the following implementation to iterate over files in a file repository:

    def iter_files(self, dvr, chunk_size=100):
        cur_index = 0
        while True:
            batch = omcore.File.list(
                self.client,
                file_repository=dvr,
                size=chunk_size,
                from_index=cur_index)
            for f in batch:
                yield f
            if len(batch) < chunk_size:
                # there is no point in asking for another batch
                return
            cur_index += len(batch)

It takes about 8 minutes to get all 1678 file records in this dataset. My choice of chunk_size is arbitrary. Are there any "good" or acceptable maximum values for it? If I set this to, e.g., 10k I can get the query to finish with the standard 30s latency. But of course, more information needs to be retrieved at once.

Are their any empirical data on what is a good trade-off?

Thx!

The default size (also in fairgraph) is 100 items. However, each batch of 100 file records imposes a 30s latency cost. For datasets with thousands of files, this quickly turns into hours. I have asked at for feedback on acceptable chunk sizes here: HumanBrainProject/fairgraph#57 (comment) In the meantime, let's go with 10k. This reduces the processing time of the Jülich Brain Atlas v3.0 from 8min to 30s (1678 files).

olinux · 2023-06-06T09:32:34Z

The problem with this is that the generic query produced by fairgraph is not optimal for this case (please note that "files" are more difficult to handle because there are plenty). A more efficient way of filtering the files for a specific file repository is to actually start at the file repository level and filter by instance id:

Running the following query with the reported instanceId "3a31dd9f-d12b-44e0-90cf-c131e9be580b" returns the 4 involved files in ~50ms

{
  "@context": {
    "@vocab": "https://core.kg.ebrains.eu/vocab/query/",
    "query": "https://schema.hbp.eu/myQuery/",
    "propertyName": {
      "@id": "propertyName",
      "@type": "@id"
    },
    "path": {
      "@id": "path",
      "@type": "@id"
    }
  },
  "meta": {
    "type": "https://openminds.ebrains.eu/core/FileRepository",
    "responseVocab": "https://schema.hbp.eu/myQuery/"
  },
  "structure": {
    "propertyName": "query:files",
    "path": {
      "@id": "https://openminds.ebrains.eu/vocab/fileRepository",
      "reverse": true
    },
    "structure": [
      {
        "propertyName": "query:name",
        "path": "https://openminds.ebrains.eu/vocab/name"
      },
      {
        "propertyName": "query:id",
        "path": "@id"
      }
    ]
  }
}

apdavison · 2023-06-19T16:02:19Z

I have finally found time to implement queries that cross links in the graph, so you can now do this:

from fairgraph import KGClient
import fairgraph.openminds.core as omcore

omcore.set_error_handling(None)
client = KGClient(host="core.kg.ebrains.eu")

dv_id = "900a1c2d-4914-42d5-a316-5472afca0d90"
dv = omcore.DatasetVersion.from_id(dv_id, client, follow_links={"repository": {"files": {}}})
len(dv.repository.files)

which gives 1678 files.

This takes a couple of seconds or less in fairgraph 0.11.0.

mih mentioned this issue Feb 6, 2023

Querying FileRepository for files is unreasonably slow datalad/datalad-ebrains#52

Open

xgui3783 mentioned this issue Jun 2, 2023

WebAlign and VoluBA allow submission of registration results to EBRAINS curation FZJ-INM1-BDA/voluba#132

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querying FileRepository for files is unreasonably slow #57

Querying FileRepository for files is unreasonably slow #57

mih commented Dec 2, 2022

apdavison commented Dec 6, 2022

apdavison commented Dec 6, 2022

mih commented Feb 6, 2023 •

edited

Loading

olinux commented Jun 6, 2023 •

edited

Loading

apdavison commented Jun 19, 2023

Querying FileRepository for files is unreasonably slow #57

Querying FileRepository for files is unreasonably slow #57

Comments

mih commented Dec 2, 2022

apdavison commented Dec 6, 2022

apdavison commented Dec 6, 2022

mih commented Feb 6, 2023 • edited Loading

olinux commented Jun 6, 2023 • edited Loading

apdavison commented Jun 19, 2023

mih commented Feb 6, 2023 •

edited

Loading

olinux commented Jun 6, 2023 •

edited

Loading