Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying FileRepository for files is unreasonably slow #57

Open
mih opened this issue Dec 2, 2022 · 5 comments
Open

Querying FileRepository for files is unreasonably slow #57

mih opened this issue Dec 2, 2022 · 5 comments

Comments

@mih
Copy link

mih commented Dec 2, 2022

Here is what I do in a nutshell

omcore.File.list(fq.client, file_repository=dvr, limit=1, from_index=0)

this takes 20+ seconds, getting a single file record, reliably (median 25s). For comparison, querying for all dataset versions of a particular dataset takes less than a second over the same connection.

Within the ranges a tested the value of limit has no impact on the latency. Even when moving from_index to a value larger than the number of files in the repository (to get an empty result list), it takes the same time.

Is there a faster way to get a file listing? Maybe some kind of iterator?

@apdavison
Copy link
Member

I think this time is determined mainly by the KG Query API, not by fairgraph itself, and is probably related to the very large number of Files in the KG (large compared to the number of dataset versions). I'll ask the KG team to look into it.

@apdavison
Copy link
Member

I did some profiling of the following script, for a repository with 4 files:

from fairgraph import KGClient
import fairgraph.openminds.core as omcore


client = KGClient(host="core.kg.ebrains.eu")

dvr_id = "https://kg.ebrains.eu/api/instances/3a31dd9f-d12b-44e0-90cf-c131e9be580b"

files = omcore.File.list(client, file_repository=dvr_id, limit=1, from_index=0)

print(len(files))

and it certainly seems all the time is spent in kg_core/kg.py:429(execute_query_by_id)

(for reference, the query id used is https://kg.ebrains.eu/api/instances/dcc56635-48a9-4328-b51a-5f2ad1d5af1e)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    488/1    0.000    0.000   23.646   23.646 {built-in method builtins.exec}
        1    0.000    0.000   23.645   23.645 query_files.py:1(<module>)
        3    0.000    0.000   23.385    7.795 /Users/adavison/dev/data/env/lib/python3.9/site-packages/requests/api.py:14(request)
        3    0.000    0.000   23.384    7.795 /Users/adavison/dev/data/env/lib/python3.9/site-packages/requests/sessions.py:500(request)
        3    0.000    0.000   23.374    7.791 /Users/adavison/dev/data/env/lib/python3.9/site-packages/requests/sessions.py:671(send)
        3    0.000    0.000   23.374    7.791 /Users/adavison/dev/data/env/lib/python3.9/site-packages/requests/adapters.py:436(send)
        3    0.000    0.000   23.372    7.791 /Users/adavison/dev/data/env/lib/python3.9/site-packages/urllib3/connectionpool.py:522(urlopen)
        3    0.000    0.000   23.371    7.790 /Users/adavison/dev/data/env/lib/python3.9/site-packages/urllib3/connectionpool.py:361(_make_request)
        1    0.000    0.000   23.321   23.321 /Users/adavison/dev/data/fairgraph/fairgraph/base_v3.py:506(list)
        2    0.000    0.000   23.315   11.658 /Users/adavison/dev/data/env/lib/python3.9/site-packages/kg_core/__communication.py:170(_get)
        2    0.000    0.000   23.315   11.658 /Users/adavison/dev/data/env/lib/python3.9/site-packages/kg_core/__communication.py:130(_request)
        2    0.000    0.000   23.315   11.658 /Users/adavison/dev/data/env/lib/python3.9/site-packages/kg_core/__communication.py:139(_do_request)
        3    0.000    0.000   23.212    7.737 /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py:1305(getresponse)
        3    0.000    0.000   23.212    7.737 /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py:309(begin)
       77    0.000    0.000   23.210    0.301 {method 'readline' of '_io.BufferedReader' objects}
        6    0.000    0.000   23.210    3.868 /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/socket.py:690(readinto)
        3    0.000    0.000   23.210    7.737 /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py:276(_read_status)
        6    0.000    0.000   23.210    3.868 /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/ssl.py:1230(recv_into)
        6    0.000    0.000   23.210    3.868 /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/ssl.py:1090(read)
        6   23.210    3.868   23.210    3.868 {method 'read' of '_ssl._SSLSocket' objects}
        1    0.000    0.000   23.175   23.175 /Users/adavison/dev/data/fairgraph/fairgraph/client_v3.py:116(query)
        1    0.000    0.000   23.175   23.175 /Users/adavison/dev/data/fairgraph/fairgraph/client_v3.py:118(_query)
        1    0.000    0.000   23.175   23.175 /Users/adavison/dev/data/env/lib/python3.9/site-packages/kg_core/kg.py:429(execute_query_by_id)
    587/2    0.001    0.000    0.249    0.125 <frozen importlib._bootstrap>:1002(_find_and_load)

@mih
Copy link
Author

mih commented Feb 6, 2023

I am currently working with https://search.kg.ebrains.eu/instances/900a1c2d-4914-42d5-a316-5472afca0d90

I am using the following implementation to iterate over files in a file repository:

    def iter_files(self, dvr, chunk_size=100):
        cur_index = 0
        while True:
            batch = omcore.File.list(
                self.client,
                file_repository=dvr,
                size=chunk_size,
                from_index=cur_index)
            for f in batch:
                yield f
            if len(batch) < chunk_size:
                # there is no point in asking for another batch
                return
            cur_index += len(batch)

It takes about 8 minutes to get all 1678 file records in this dataset. My choice of chunk_size is arbitrary. Are there any "good" or acceptable maximum values for it? If I set this to, e.g., 10k I can get the query to finish with the standard 30s latency. But of course, more information needs to be retrieved at once.

Are their any empirical data on what is a good trade-off?

Thx!

mih added a commit to datalad/datalad-ebrains that referenced this issue Feb 6, 2023
The default size (also in fairgraph) is 100 items. However, each batch
of 100 file records imposes a 30s latency cost. For datasets with
thousands of files, this quickly turns into hours.

I have asked at for feedback on acceptable chunk sizes here:
HumanBrainProject/fairgraph#57 (comment)

In the meantime, let's go with 10k. This reduces the processing time of
the Jülich Brain Atlas v3.0 from 8min to 30s (1678 files).
@olinux
Copy link

olinux commented Jun 6, 2023

The problem with this is that the generic query produced by fairgraph is not optimal for this case (please note that "files" are more difficult to handle because there are plenty). A more efficient way of filtering the files for a specific file repository is to actually start at the file repository level and filter by instance id:

Running the following query with the reported instanceId "3a31dd9f-d12b-44e0-90cf-c131e9be580b" returns the 4 involved files in ~50ms

{
  "@context": {
    "@vocab": "https://core.kg.ebrains.eu/vocab/query/",
    "query": "https://schema.hbp.eu/myQuery/",
    "propertyName": {
      "@id": "propertyName",
      "@type": "@id"
    },
    "path": {
      "@id": "path",
      "@type": "@id"
    }
  },
  "meta": {
    "type": "https://openminds.ebrains.eu/core/FileRepository",
    "responseVocab": "https://schema.hbp.eu/myQuery/"
  },
  "structure": {
    "propertyName": "query:files",
    "path": {
      "@id": "https://openminds.ebrains.eu/vocab/fileRepository",
      "reverse": true
    },
    "structure": [
      {
        "propertyName": "query:name",
        "path": "https://openminds.ebrains.eu/vocab/name"
      },
      {
        "propertyName": "query:id",
        "path": "@id"
      }
    ]
  }
}

@apdavison
Copy link
Member

I have finally found time to implement queries that cross links in the graph, so you can now do this:

from fairgraph import KGClient
import fairgraph.openminds.core as omcore

omcore.set_error_handling(None)
client = KGClient(host="core.kg.ebrains.eu")

dv_id = "900a1c2d-4914-42d5-a316-5472afca0d90"
dv = omcore.DatasetVersion.from_id(dv_id, client, follow_links={"repository": {"files": {}}})
len(dv.repository.files)

which gives 1678 files.

This takes a couple of seconds or less in fairgraph 0.11.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants