Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File downloads #176

Open
maximlt opened this issue Nov 20, 2024 · 5 comments
Open

File downloads #176

maximlt opened this issue Nov 20, 2024 · 5 comments

Comments

@maximlt
Copy link

maximlt commented Nov 20, 2024

anaconda-project allows specifying a list of files to download in its configuration file, with additional features like optional unzip, optional hash verification, optional file renaming, etc. See https://anaconda-project.readthedocs.io/en/latest/user-guide/reference.html#file-downloads for more info.

We use this feature a fair bit on the HoloViz Examples Gallery (50% of the projects) https://github.com/holoviz-topics/examples/.

Is this feature in the scope of conda-project?

@jbednar
Copy link

jbednar commented Nov 20, 2024

Please see the discussion at holoviz-topics/examples#444 for more context. I personally think it's important to support datasets that aren't practical to include directly in a project archive, but my perspective as someone who primarily maintains publicly accessible projects might put me outside of the mainstream of enterprise users who work primarily with proprietary data, where one will encounter thorny issues with authentication and access control for data, compared to scientific and academic and library-author users like me.

So I'd say the questions for conda-project are:

  1. Is it fair to say that conda-project is primarily focused on "data science" projects or "data" projects or "science" projects (as opposed to projects like hatch, uv, etc. that seem more about encapsulating code rather than code + data)? (My vote: yes)
  2. If so, is it fair to assume that any data needed will be small enough that it's practical to include it directly in a project archive? (My vote: no)
  3. If so, is it fair to assume that only public datasets (or private datasets that can be accessed without authentication, e.g. for users behind a firewall) will be supported directly? (My vote: Maybe assume public data only at first, and postpone anything to do with auth?)

@AlbertDeFusco
Copy link
Contributor

I will start with @jbednar 's comment

  1. Yes, absolutely. I have been focused on non-packaging usecases, i.e. Data Science workflows. But, as yet have not expressed workflows beyond environments and commands
  2. Not really, a project owner/developer is free to integrate with any remote sources, but today it is their responsibility. And issues of caching large datasets (and not checking them into a repo or archive) can be provided bespoke in the project or with a 3rd party tool.
    1. For example if the owner's chosen data access tool requires the consumer to have an API key or other credentials the requirement can be expressed as an env var and if that env var has no default value the consumer is required to set it before the command can run (I think this was not yet carefully documented)
    commands:
      my_cmd:
        cmd: ...
        variables:
          API_KEY:
    ❯ conda project run my_cmd
    CondaProjectError: The following variables do not have a default value and values
    were not provided in the .env file or set on the command line when executing 'conda project run':
    API_KEY
    
  3. I believe there are some ways that we can get both and provide a way forward

@maximlt, the issue I want to work through is "who's responsibility is it? Project owner/developer, project consumer, project framework?". One interpretation is that if your project requires access to large remote datasets why not be entirely responsible for that in your code? Have you considered or found a 3rd party data tool that can fill your needs?

A few 3rd party tools that I'm aware of in this space are:

Is it that there are no good 3rd party tools that fit your needs? What would be the best-case user experience for a project owner/dev and project consumer, not necessarily how anaconda-project implemented it? Are you in need of an external data storage/retrieval system that is accessible across multiple projects?

My initial reaction is to

  • find a 3rd party tool to integrate into conda-project
  • develop a pluggable system in conda-project where command dependencies can be expressed
    • going as far as having plugin packages that provide 3rd party integrations

@jbednar
Copy link

jbednar commented Nov 20, 2024

Also consider https://www.fatiando.org/pooch

@maximlt
Copy link
Author

maximlt commented Nov 21, 2024

We are a bit of a special case as we have a lot of anaconda-project examples that rely on its downloads feature, so it's tempting for us to just ask for conda-project its successor to re-implement the feature :) Our use of Intake on examples.holoviz.org is also, I think, a bit of a stretch as I'd imagine its place lies more in a larger org where catalogs are curated and shared across collaborators, referencing many data sources. In our case, we have multiple tiny catalogs without a large variety of data source types. So Intake doesn't abstract so much code, but this is code that needs to be maintained and documented in Intake itself or its plugins, and we've seen that's a bit challenging.

Yes, absolutely. I have been focused on non-packaging usecases, i.e. Data Science workflows. But, as yet have not expressed workflows beyond environments and commands

Thanks for making this clear! I'd note that there's a vast array of applications (non-libraries) like CLI, GUI, Web apps, etc. that don't fall into the "Data Science workflows". At the moment, I'd choose pixi or uv if I started a project of that kind, as they're both quickly gaining adoption and contributors. conda-project focusing on "Data Science workflows" is good news (it wasn't clear to me until very recently), as it means it can focus on covering this use case while the other tools tend to be generic. It can be adapted to its target audience (non developers usually, who e.g. need more user friendly docs). Given this scope, adding an equivalent to the downloads feature seems appropriate, as it is a very handy way to prepare and share a project. For example, the other day I had to work offline from a train, and anaconda prepared two projects beforehand, I knew that would be enough to get me going.

Also consider https://www.fatiando.org/pooch

Just to add to the list, I randomly found this project today https://github.com/dlt-hub/dlt

@jbednar
Copy link

jbednar commented Nov 21, 2024

One interpretation is that if your project requires access to large remote datasets why not be entirely responsible for that in your code?

That's a good question. It's for didactic reasons -- I want to avoid sidetracking a reader of my code with details about how to obtain the data, and especially about how to cache a large dataset to make it practical to work with. Data access is a complicated topic with details that differ for every person and every dataset and every organization. Both in my former career as a scientist and in my current career as a maintainer of HoloViz tools and docs, the story I'm telling with a given notebook or project is rarely a story about data access; it's a story about how to use our viz and analysis tools, how we came up with our scientific results, the cool things you can show once you do have the data, etc.

If a data-access abstraction tool like Intake had become successful as the way to access data that everyone buys into, then I'd be happy to put a simple intake call at the start of each notebook in examples.holoviz.org, and leave the data-access details for an Intake configuration file for someone to inspect if they want. But right now, if each notebook starts with a call to Intake, users won't recognize what they themselves should do with their own data besides "learn how to use Intake", which isn't necessarily advice I'd actually give them. Instead I greatly prefer having the notebook show something familiar like pd.read_csv that will translate directly to their own work without forcing them to buy into Intake. But I don't want the notebook to show details about where the data comes from like fetching from S3, caching it, etc. -- all those things detract from the actual story in the notebook, and users will have their own way to fetch their own data anyway.

Our setup with anaconda-project right now isn't perfect, at least partly because no one has anaconda-project already and hasn't heard of it. But it does have the advantage that once the project has been fetched and prepared, the data is there, without any code in the notebook itself showing those details about how to fetch it. It may not be defensible to continue doing things this way with conda-project, and if so, we can deal with it. But I'd guess that academic users, library example authors, and tutorial authors would all be glad to obscure the data access details for any material that's not explicitly about data access, so I don't think my situation is an edge case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants