Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace anaconda-project downloads #444

Open
maximlt opened this issue Nov 12, 2024 · 11 comments
Open

Replace anaconda-project downloads #444

maximlt opened this issue Nov 12, 2024 · 11 comments

Comments

@maximlt
Copy link
Contributor

maximlt commented Nov 12, 2024

anaconda-project has a handy feature that allows to declare a series of files to download (and optionally unzip) when preparing a project (see https://anaconda-project.readthedocs.io/en/latest/user-guide/reference.html#file-downloads). Some day we will need to replace anaconda-project by another tool (e.g. conda-project, pixi) which, at the moment, don't provide this feature. To prepare this transition, we'll need to find an alternative way to download data.

Features we use:

  • Unzip
  • Declare the filename to save as, or folder (filename: data) for archives to unzip
  • No re-download when the file is found in its target location => We rely on this feature for testing purposes, moving test data to the right place before preparing the project

Potential alternatives:

@maximlt
Copy link
Contributor Author

maximlt commented Nov 12, 2024

Noting that the ability to download data while preparing the project/environment is useful for instance when an example is deployed. Otherwise, if an example requires a large dataset to be downloaded, the first user is going to have to wait a little too long :) No big deal but not great.

@maximlt
Copy link
Contributor Author

maximlt commented Nov 15, 2024

cc @jbednar as I am aware this is a topic you're thinking about these days.

@maximlt
Copy link
Contributor Author

maximlt commented Nov 15, 2024

Paper a bit outdated (5 years old) by the authors of pooch, stating that the only alternatives to their knowledge were fsspec and intake: https://github.com/fatiando/pooch/blob/main/paper/paper.md.

@maximlt
Copy link
Contributor Author

maximlt commented Nov 15, 2024

If we don't want to commit to Intake anymore and there's no tool replacing it that meets our needs, then I can imagine we could standardize something around a data.py file that each project has (when they need external data), that has a command line interface (argparse) allowing to download (and unarchive) the data with the tool we choose (e.g. pooch), and also to set up the test data. We'd run this file before testing/building the project with the right command line arguments. Then, to read the data, we'd do it either in the notebook when it's simple (e.g. df = pd.read_csv('data/dataset.csv')) and/or have a utility function in data.py that hides/abstracts this for the more complex cases like it's done with Intake (e.g. from data import get_complex_data; ds = get_complex_data()).

@jbednar
Copy link
Contributor

jbednar commented Nov 15, 2024

Are we not going to be able to have this capability in conda-project? Would be good to discuss that with the conda-project developers and see what the right approach could be. Projects do generally need to have data or they won't be useful...

@maximlt
Copy link
Contributor Author

maximlt commented Nov 15, 2024

Are we not going to be able to have this capability in conda-project? Would be good to discuss that with the conda-project developers and see what the right approach could be.

I don't know, other tools like uv, poetry, pixi don't have that built-in. I'm not sure I want to push this feature request, feel free to do so! What I'm also uncomfortable with is just the feeling we're re-creating anaconda-project, and also the idea to be locked in a tool (which so far has no users) with a unique feature.

Projects do generally need to have data or they won't be useful...

Data projects yes but that's not all application projects (e.g. a simple GUI), and library projects usually not.

@jbednar
Copy link
Contributor

jbednar commented Nov 15, 2024

Sure, but uv, poetry, and pixi aren't specifically made for data projects like those in this repo, and conda-project is, so here I'm talking about data projects. Plus the number of data projects greatly outweighs the number of application projects. E.g. there are currently 10 million Jupyter Notebooks on Github, versus maybe some hundreds of thousands of libraries that get packaged up. So I'm concerned about having a good solution for data projects, whether that solution is in conda-project or via some other tool.

@maximlt
Copy link
Contributor Author

maximlt commented Nov 16, 2024

Sure, but uv, poetry, and pixi aren't specifically made for data projects like those in this repo, and conda-project is, so here I'm talking about data projects.

Really? There's no single mention of the data word on conda-project's README (https://github.com/conda-incubator/conda-project).

Plus the number of data projects greatly outweighs the number of application projects. E.g. there are currently 10 million Jupyter Notebooks on Github, versus maybe some hundreds of thousands of libraries that get packaged up.

Application projects don't include libraries in my mind, but things like API, CLI, GUI, scripts, etc. I can't tell if there are more of them than data application projects, but yes for sure there are many data projects out there.

So I'm concerned about having a good solution for data projects, whether that solution is in conda-project or via some other tool.

I'd also love to have a good solution for data projects. But as someone who got to maintain Examples for a little while, I wouldn't commit to a tool that makes it more difficult to maintain Examples (not well maintained, low adoption, etc.). In which case, I'd rather rely on something custom that can easily be migrated if need be.

@jbednar
Copy link
Contributor

jbednar commented Nov 20, 2024

Yes, really. :-) The conda-project README says:

Sharing your work is more than sharing your code in a script file or notebook. To make your work properly reproducible, it is necessary to include the list of required third-party dependencies, specifications for how to run your code, and any other files that it may need.

The "other files" includes data; what else would that be? Then it links to my "8 Levels of Reproduciblity", which was written about data projects, or at least notebooks or dashboards rather than libraries or APIs or CLIs. Then it says:

This package is intended as a successor to Anaconda Project.

Which in turn says:

Tool for encapsulating, running, and reproducing data science projects.

Take any directory full of stuff that you're working on; web apps, scripts, Jupyter notebooks, data files, whatever it may be.
By adding an anaconda-project.yml to this project directory, a single anaconda-project runcommand will be able to set up all dependencies and then launch the project.

So sure, conda-project needs some better, clearer docs, but I consider it to be coming very clearly from a perspective of "package up some code with all the stuff needed to reproduce a result" rather than something like "I have written a library I want to share with other people who will then import it" or "I have written an end-user application that I want to publish on an app store".

But as someone who got to maintain Examples for a little while, I wouldn't commit to a tool that makes it more difficult to maintain Examples (not well maintained, low adoption, etc.). In which case, I'd rather rely on something custom that can easily be migrated if need be.

Well, conda-project isn't something that came from heaven; it was written by some co-workers of yours, and so I think you can either (1) contribute to making it be something that meets your needs, (2) write something completely custom, or (3) find something that already meets your needs. I haven't seen (3) show up in this thread or elsewhere, and between 1 and 2 I'd vote for 1, since collaborating on a shared tool that we together make into something valuable seems much better than us developing some custom solution just for our narrow use case, which would mean something with even lower adoption and even worse maintenance.

@maximlt
Copy link
Contributor Author

maximlt commented Nov 20, 2024

I've opened an issue to ask about that feature on conda-project conda-incubator/conda-project#176

Well, conda-project isn't something that came from heaven; it was written by some co-workers of yours, and so I think you can either (1) contribute to making it be something that meets your needs, (2) write something completely custom, or (3) find something that already meets your needs. I haven't seen (3) show up in this thread or elsewhere, and between 1 and 2 I'd vote for 1, since collaborating on a shared tool that we together make into something valuable seems much better than us developing some custom solution just for our narrow use case, which would mean something with even lower adoption and even worse maintenance.

What I want more than anything else is that, when we decide to migrate away from anaconda-project (or are forced when it starts to break, e.g. with a new Python version), we pick a tool that is already widely used.

@jbednar
Copy link
Contributor

jbednar commented Nov 20, 2024

To be clear, "data project" does not necessarily imply that there is the ability to fetch data; it just means that we are expecting that most projects will somehow work with data. Fetching data is only crucial when datasets are much larger than the rest of the project such that it makes sense to treat them differently. So while I strongly consider conda-project to be about data projects primarily, whether it should have functionality about fetching data is a separate question best discussed at that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants