-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File downloads #176
Comments
Please see the discussion at holoviz-topics/examples#444 for more context. I personally think it's important to support datasets that aren't practical to include directly in a project archive, but my perspective as someone who primarily maintains publicly accessible projects might put me outside of the mainstream of enterprise users who work primarily with proprietary data, where one will encounter thorny issues with authentication and access control for data, compared to scientific and academic and library-author users like me. So I'd say the questions for conda-project are:
|
I will start with @jbednar 's comment
@maximlt, the issue I want to work through is "who's responsibility is it? Project owner/developer, project consumer, project framework?". One interpretation is that if your project requires access to large remote datasets why not be entirely responsible for that in your code? Have you considered or found a 3rd party data tool that can fill your needs? A few 3rd party tools that I'm aware of in this space are: Is it that there are no good 3rd party tools that fit your needs? What would be the best-case user experience for a project owner/dev and project consumer, not necessarily how anaconda-project implemented it? Are you in need of an external data storage/retrieval system that is accessible across multiple projects? My initial reaction is to
|
Also consider https://www.fatiando.org/pooch |
We are a bit of a special case as we have a lot of
Thanks for making this clear! I'd note that there's a vast array of applications (non-libraries) like CLI, GUI, Web apps, etc. that don't fall into the "Data Science workflows". At the moment, I'd choose
Just to add to the list, I randomly found this project today https://github.com/dlt-hub/dlt |
That's a good question. It's for didactic reasons -- I want to avoid sidetracking a reader of my code with details about how to obtain the data, and especially about how to cache a large dataset to make it practical to work with. Data access is a complicated topic with details that differ for every person and every dataset and every organization. Both in my former career as a scientist and in my current career as a maintainer of HoloViz tools and docs, the story I'm telling with a given notebook or project is rarely a story about data access; it's a story about how to use our viz and analysis tools, how we came up with our scientific results, the cool things you can show once you do have the data, etc. If a data-access abstraction tool like Intake had become successful as the way to access data that everyone buys into, then I'd be happy to put a simple intake call at the start of each notebook in examples.holoviz.org, and leave the data-access details for an Intake configuration file for someone to inspect if they want. But right now, if each notebook starts with a call to Intake, users won't recognize what they themselves should do with their own data besides "learn how to use Intake", which isn't necessarily advice I'd actually give them. Instead I greatly prefer having the notebook show something familiar like Our setup with anaconda-project right now isn't perfect, at least partly because no one has anaconda-project already and hasn't heard of it. But it does have the advantage that once the project has been fetched and prepared, the data is there, without any code in the notebook itself showing those details about how to fetch it. It may not be defensible to continue doing things this way with conda-project, and if so, we can deal with it. But I'd guess that academic users, library example authors, and tutorial authors would all be glad to obscure the data access details for any material that's not explicitly about data access, so I don't think my situation is an edge case. |
anaconda-project
allows specifying a list of files to download in its configuration file, with additional features like optional unzip, optional hash verification, optional file renaming, etc. See https://anaconda-project.readthedocs.io/en/latest/user-guide/reference.html#file-downloads for more info.We use this feature a fair bit on the HoloViz Examples Gallery (50% of the projects) https://github.com/holoviz-topics/examples/.
Is this feature in the scope of
conda-project
?The text was updated successfully, but these errors were encountered: