Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading remote zarrs via authenticated HTTP calls #9

Open
tcompa opened this issue Nov 4, 2024 · 5 comments
Open

Support reading remote zarrs via authenticated HTTP calls #9

tcompa opened this issue Nov 4, 2024 · 5 comments

Comments

@tcompa
Copy link

tcompa commented Nov 4, 2024

The main goal of this explorative issue is to read remote zarrs over HTTP, when this HTTP calls require some authentication/authorization. I would postpone thinking about supporting write operations, especially because I cannot say whether it's a relevant use case (would someone really operate over HTTP, apart from the use case of reading existing datasets?)


The simplest example I can come up with is inspired e.g. on zarr-developers/zarr-python#1568, zarr-developers/zarr-python#993, pangeo-forge/pangeo-forge-recipes#222 (and the pangeo one is an interesting view into more integrated use cases of globus).

Starting from fsspec.implementations.http.HTTPFileSystem, we can include a client_kwargs argument which is then passed to the underlying aiohttp.ClientSession calls. An example from the fsspec docs is

client_kwargs = {'auth': aiohttp.BasicAuth('user', 'pass')}

To use HTTPFileSystem for a zarr array either via zarr-python or dask.array, we can proceed as in

import fsspec.implementations.http
import zarr
import dask.array as da


fs = fsspec.implementations.http.HTTPFileSystem(client_kwargs={})
url = (
    "https://raw.githubusercontent.com/tcompa/hosting-ome-zarr-on-github/refs/heads/main/"
    "20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr/B/03/0/0"
)
store = fs.get_mapper(url)

array_zarr = zarr.open_array(store)
print(f"{array_zarr=}")

array_dask = da.from_zarr(store)
print(f"{array_dask=}")

with output

array_zarr=<zarr.core.Array (3, 1, 2160, 5120) uint16>

/somewhere/venv/lib/python3.10/site-packages/zarr/creation.py:614: UserWarning: ignoring keyword argument 'read_only'
  compressor, fill_value = _kwargs_compat(compressor, fill_value, kwargs)

array_dask=dask.array<from-zarr, shape=(3, 1, 2160, 5120), dtype=uint16, chunksize=(1, 1, 2160, 2560), chunktype=numpy.ndarray>

Given such minimal example, the question is whether this could fit anywhere in ngio. To phrase it differently: is it relevant/worth for ngio to integrate fsspec? I do not know ngio well enough for answering.


Next steps, in my understanding:

  1. Understand how much ngio is (or can be) integrated with fsspec, and how costly it would be to have an abstraction that propagates user-provided kwargs to the fsspec store.
  2. If we want to proceed, setup a simple test server that serves zarrs over HTTP with some authentication required (see upcoming issue).
@lorenzocerrone
Copy link
Collaborator

lorenzocerrone commented Nov 5, 2024

Hi @tcompa,

Thanks for the example. I have played a bit with it, and it seems that fully supporting fsspec stores in ngio is not going to be too hard:

It took only a few minor changes #10

from ngio import NgffImage
import matplotlib.pyplot as plt

import fsspec.implementations.http

fs = fsspec.implementations.http.HTTPFileSystem(client_kwargs={})
url = (
    "https://raw.githubusercontent.com/tcompa/hosting-ome-zarr-on-github/refs/heads/main/"
    "20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr/B/03/0/"
)
store = fs.get_mapper(url)


ngff_image = NgffImage(store)
print(f"list of images: {ngff_image.levels_paths}")
image = ngff_image.get_image(path="2")

print(f'list labels: {ngff_image.label.list()}')
nuclei = ngff_image.label.get_label("nuclei")

print(f"nuclei: {nuclei.get_array(mode='dask').shape}")
print(f"image: {image.get_array(mode='dask').shape}")

Should produce:

list of images: ['0', '1', '2', '3', '4']
list labels: ['nuclei']
nuclei: (1, 540, 1280)
image: (3, 1, 540, 1280)

The only part where support would require a bit more work is the tables.
The reason is that I rely on the Zarr.Group.groups methods to validate the coherency between metadata and disk.
This does not work on remote storage, so I must change my approach.

In concrete if I try:

import fsspec.implementations.http
import zarr
import dask.array as da

fs = fsspec.implementations.http.HTTPFileSystem(client_kwargs={})
url = (
    "https://raw.githubusercontent.com/tcompa/hosting-ome-zarr-on-github/refs/heads/main/"
    "20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr/B/03/0/"
)
store = fs.get_mapper(url)

group_zarr = zarr.open_group(store)
print("list of subgroups:", list(group_zarr.groups()))
print("list of arrays:", list(group_zarr.arrays()))

I won't find the subgroups and sub-arrays.

list of subgroups: []
list of arrays: []

I can see two strategies:

  1. rely on metadata only (simple)
  2. rely on metadata only for remote stores and validate coherency on disk

@tcompa
Copy link
Author

tcompa commented Nov 5, 2024

It took only a few minor changes #10

This is very encouraging!

Without any deep knowledge of ngio from my side, it seems nice to be able to "inject" some fsspec native object without ngio knowing too much about fsspec itself.

@tcompa
Copy link
Author

tcompa commented Nov 5, 2024

The only part where support would require a bit more work is the tables.
The reason is that I rely on the Zarr.Group.groups methods to validate the coherency between metadata and disk.
This does not work on remote storage, so I must change my approach.

If you look at https://zarr.readthedocs.io/en/stable/_modules/zarr/hierarchy.html#Group.groups, you'll see that there is a different behavior for zarr v2 and v3.


Just to make sure: is this another instance of zarr-developers/zarr-python#1568?

In that case, there is no obvious way out - see these quotes from that issue:

Since HTTP can't really do file listing (except a few special cases derived from FTP), it can only be used with datasets that have consolidated metadata. Without [consolidated] metadata, zarr needs listing to know what arrays are contained in a group.

does this store have a consolidated metadata object (.zmetadata) at its root? Without it, it won't be able to list members.


To rephrase it as a more concrete comment:

  • Understanding this groups issue probably requires understanding what listdir(store, path) does - at least for a couple of stores (the str one and the http one).
  • There may exist a third strategy that is "only support this feature for zarr v3 groups with consolidated metadata" (fully TBD)

@lorenzocerrone
Copy link
Collaborator

Thanks for the resources!

I did not know about consolidated metadata, but it is a great way to group all metadata in a single place. We should have ngio calling consolidate every time we create a new element in the Zarr hierarchy. This would make large plate metadata parsing much more efficient.

I think, for now, it's ok just to avoid relying on Zarr internals to discover groups and arrays. This logic will be heavily refactored when we switch to v3 anyway.

I have only a small additional question: should ngio be agnostic to auth?
I can foresee two cases:

  • ngio is instantiated with an authenticated store (like in the example) and knows nothing about auth.
  • ngio deals with the authentication internally

@tcompa
Copy link
Author

tcompa commented Nov 6, 2024

I have only a small additional question: should ngio be agnostic to auth? I can foresee two cases:

* ngio is instantiated with an authenticated store (like in the example) and knows nothing about auth.
* ngio deals with the authentication internally

In my opinion, at first I would stick with option 1 (ngio knows nothing about authentication, but it can use an arbitrary fsspec store).

The complex part of option 2, in my view, would be the following:
In order to set the authentication parameters from within ngio, you'd need to implement logic to decide which fsspec model to use (HTTPFileSystem? other?). If I remember correctly, this is also done in other libraries (zarr and/or dask), meaning there would be a lot of room for either redundant or conflicting logic. It's much easier if ngio can be agnostic and just send the "store" (either a simple path/URL or a full-fledged fsspec object) to the loader.


To put the question in a broader context: where will it be relevant for ngio to use specific fsspec objects (e.g. the HTTPFileSystem one)? This question is independent on the specific case of auth-related additional parameters, as there could exist different configuration parameters.

Relevant use cases:

  1. Loading data over HTTP, for a viewer plugin.
  2. Writing data over HTTP? I don't think this will ever be relevant.
  3. Accessing data over s3:
  4. Any other?

Understanding these use cases better would help you decide whether it's relevant for ngio to integrate the creation of fsspec objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants