Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eLabFTW integration via Galaxy file source #19319

Open
wants to merge 11 commits into
base: dev
Choose a base branch
from

Conversation

kysrpex
Copy link
Contributor

@kysrpex kysrpex commented Dec 12, 2024

Closes #18665, requires #19154 and #19256 (and also includes them, sorry; to review have a look at c00d250). Implements a file source to integrate Galaxy with eLabFTW.

eLabFTW revolves around the concepts of experiment and resource. Experiments and resources can have files attached to them. To get a quick overview, try out the live demo at demo.elabftw.net. The scope of this implementation is exporting data from and importing data to eLabFTW as file attachments of already existing experiments and resources. Each user can configure their preferred eLabFTW instance entering its URL and an API Key.

File sources reference files via a URI, while eLabFTW uses auto-incrementing positive integers. For more details read #18665. This leads to the need to declare a mapping between said identifiers and Galaxy URIs.

Those take the form elabftw://demo.elabftw.net/entity_type/entity_id/attachment_id, where:

  • entity_type is either 'experiments' or 'resources'
  • entity_id is the id (an integer in string form) of an experiment or resource
  • attachment_id is the id (an integer in string form) of an attachment

This implementation uses both aiohttp and the requests libraries as underlying mechanisms to communicate with eLabFTW via its REST API. A significant limitation of the implementation is that, due to the fact that the API does
not have an endpoint that can list attachments for several experiments and/or resources with a single request, when
listing the root directory or an entity type recursively, a list of entities has to be fetched first, then to fetch
the information on their attachments, a separate request has to be sent for each one of them. The aiohttp library makes it bearable to recursively browse instances with up to ~500 experiments or resources with attachments by sending them
concurrently, but ultimately solving the problem would require changes to the API from the eLabFTW side.

This is the third and last PR of a series of PRs that integrate eLabFTW with Galaxy via a file source (together they address issue #18665):

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. Run eLabFTW, for example using Docker Compose.
    2. Copy the configuration samples from file_sources_conf.yml.sample and user_preferences_extra_conf.yml.sample to your own configuration files.
    3. Create an account and generate an API Key on eLabFTW.
    4. Configure the endpoint and API Key on the user preferences page.
    5. Create an experiment or resource in eLabFTW and attach a file.
    6. Import the file to your history from the eLabFTW file source.
    7. Export your history to the eLabFTW file source.

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

The method `write_from()` of `SingleFileSource` and `BaseFilesSource` reads a local file from `native_path` and saves it to `target_path` on a file source.

This commit allows the service backing the file source to choose which will be the path of the saved file, meaning that `target_path` and the actual path where the file can be recovered later do not have to match. The latter is the return value of `write_from()`.

Therefore, all usages of `write_from()` have also been refactored to consider the paths chosen by the file source's backing service. In addition, when exporting a history, the URI that the service backing the file source assigns to it will be saved to the history export result metadata object.
Change the implementation so that the definitions of `_write_from` in classes that inherit from `BaseFilesSource` do not need to change.
Define a new class `FileSourceModelExportStore` that abstracts the common details of `BcoModelExportStore`, `ROCrateArchiveModelExportStore`, `TarModelExportStore` and `BagArchiveModelExportStore`.

This new class manages exports to file sources, from where data can be retrieved later on via a URI. It takes the responsibility of creating a temporary directory to set up the file to export and uploading it to the file source.
The method `list()` from `galaxy.files.sources.BaseFilesSource` lists the directories and files within a file source. An optional keyword argument `recursive` (`False` by default) lets it recursively retrieve directories and files within a specific directory.

This operation is very cheap in terms of CPU and expensive in IO terms, be it network or filesystem IO. Depending on how the underlying system is built, it may support retrieving directories and files recursively or not. If it does not, then every time a directory is listed, it is necessary to make another request to list each subdirectory. This may end up involving hundreds of requests. Done sequentially, this can be extremely slow, especially if each one involves network access.

This commit makes the `list()` method asynchronous, which enables Galaxy to wait for the underlying system to complete the requests concurrently, resulting in a massive speedup. The price to pay is the extra complexity of using the async primitives.

Since this change implies that all functions in the chain up to the API endpoints and the test functions must also be made asynchronous, this commit also takes care of it.
@kysrpex
Copy link
Contributor Author

kysrpex commented Dec 12, 2024

Eventually I can add automated tests, I would rather ask for your feedback first though.

eLabFTW [1] revolves around the concepts of experiment [2] and resource [3]. Experiments and resources can have files attached to them. To get a quick overview, try out the live demo [4]. The scope of this implementation is exporting data from and importing data to eLabFTW as file attachments of already existing experiments and resources. Each user can configure their preferred eLabFTW instance entering its URL and an API Key.

File sources reference files via a URI, while eLabFTW uses auto-incrementing positive integers. For more details read galaxyproject#18665 [5]. This leads to the need to declare a mapping between said identifiers and Galaxy URIs.

Those take the form `elabftw://demo.elabftw.net/entity_type/entity_id/attachment_id`, where:
- `entity_type` is either 'experiments' or 'resources'
- `entity_id` is the id (an integer in string form) of an experiment or resource
- `attachment_id` is the id (an integer in string form) of an attachment

This implementation uses both `aiohttp` and the `requests` libraries as underlying mechanisms to communicate with eLabFTW via its REST API [6]. A significant limitation of the implementation is that, due to the fact that the API does not have an endpoint that can list attachments for several experiments and/or resources with a single request, when listing the root directory or an entity type _recursively_, a list of entities has to be fetched first, then to fetch the information on their attachments, a separate request has to be sent _for each one_ of them. The `aiohttp` library makes it bearable to recursively browse instances with up to ~500 experiments or resources with attachments by sending them concurrently, but ultimately solving the problem would require changes to the API from the eLabFTW side.

References:
- [1] https://www.elabftw.net/
- [2] https://doc.elabftw.net/user-guide.html#experiments
- [3] https://doc.elabftw.net/user-guide.html#resources
- [4] https://demo.elabftw.net
- [5] galaxyproject#18665
- [6] https://doc.elabftw.net/api/v2
Copy link
Contributor

@davelopez davelopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in person, the implementation looks amazing! especially because of the care and attention to detail! Thank you very much!

I also think that the recursive option that has motivated many of these changes (and increased complexity) is a use case that we should probably not support. Being able to recursively list or select a potentially huge number of experiments without strict pagination will likely be unusable or cause more trouble than it is worth... 😓

@@ -296,7 +299,7 @@ def get_uri_root(self) -> str:
"""Return a prefix for the root (e.g. gxfiles://prefix/)."""

@abc.abstractmethod
def list(
async def list(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't change this for existing interfaces. You can add a <operation>_async method where needed. We can't do blocking IO in async calls, but every plugin apart from yours will make blocking calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

eLabFTW file source for Galaxy
3 participants