Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rethinkdb required for seasonal flu builds to work #3

Closed
huddlej opened this issue Oct 25, 2022 · 10 comments
Closed

rethinkdb required for seasonal flu builds to work #3

huddlej opened this issue Oct 25, 2022 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@huddlej
Copy link
Contributor

huddlej commented Oct 25, 2022

Current Behavior

Using the Nextstrain CLI with the managed conda environment, I tried to download data from fauna (the first step of the seasonal flu builds) like so:

nextstrain build --conda . --forceall --configfile profiles/nextflu-private.yaml -p data/h1n1pdm/who_cell_hi_titers.tsv

This command failed with a module not found error because fauna needs the rethinkdb module and it is not installed in the managed conda environment. In contrast, the following command works as expected:

nextstrain build --docker . --forceall --configfile profiles/nextflu-private.yaml -p data/h1n1pdm/who_cell_hi_titers.tsv

Expected behavior

Managed conda environment should behave like Docker environment.

How to reproduce

Steps to reproduce the current behavior:

  1. Install the CLI from the trs/conda/nextstrain-base-package branch of the GitHub repo
  2. Setup the managed conda environment (NEXTSTRAIN_CONDA_CHANNEL=nextstrain/label/pull-1 nextstrain setup conda, NEXTSTRAIN_CONDA_CHANNEL=nextstrain/label/branch-initial nextstrain update conda, and nextstrain setup --set-default conda)
  3. Clone the seasonal flu repo and git checkout refactor-workflow
  4. From the seasonal flu repo, run nextstrain build --conda . --forceall --configfile profiles/nextflu-private.yaml -p data/h1n1pdm/who_cell_hi_titers.tsv
  5. See error.

Possible solution

Since I have fauna installed in the parent directory of my workflow (where the workflow expects to find it), installing rethinkdb should be enough to fix this issue. We should not need to install fauna.

@huddlej
Copy link
Contributor Author

huddlej commented Oct 26, 2022

I looked into what we'd actually need to include rethinkdb in this environment. Some important factors are:

To include the specific rethinkdb version we need for fauna in this base environment, it looks like we need to create our own conda package for this version. We could host it in the Nextstrain channel. Bioconda is not an appropriate place for it and I don't think we want to support a conda-forge package for rethinkdb (or give the impression we are responsible for rethinkdb).

@tsibley Does this summary make sense? I'd love to learn how to use our Nextstrain channel through Anaconda, so I could try setting up the rethinkdb package.

@tsibley
Copy link
Member

tsibley commented Oct 26, 2022

Thanks for the great description here, @huddlej! I was wondering how fast

# XXX TODO: Include fauna?

was going to come back to haunt me. Pretty fast it turns out! 🙃

Agreed that the thing to do here given the constraints you outlined is to produce our own Conda package for the Python RethinkDB bindings and host it in our channel. I think this should be relatively straightforward (if involving some minor tedium), and I'd be happy to help guide you through it.

Some things to consider:

  • We might name this something other than just rethinkdb, e.g. nextstrain-rethinkdb or fauna-rethinkdb or nextstrain-fauna-rethinkdb, in order to ensure it's only used for the singular purpose of satisfying Fauna.
  • If the Python package can be noarch, then we don't necessarily even need CI setup for it since we expect to build a single version of the package and then not do it again (or do it very rarely). Local, manual rebuilds + uploads would be fine in that case.
  • We should consider tracking the recipe source in this repo (in some subdir) instead of in another repo to further indicate it's really only for this conda-base package.
  • Would it be worth also packaging up Fauna? Or not at all because you need an uninstalled copy of the source available to run anything anyway?

@huddlej
Copy link
Contributor Author

huddlej commented Oct 26, 2022

Cool, let's do it. Maybe end of next week? This feels like a nice Friday afternoon kind of task...

I like nextstrain-fauna-rethinkdb for its specificity. I don't think we need to package fauna; like you said, most workflows that still use it expect to find fauna as a sibling directory of the workflow directory.

@tsibley
Copy link
Member

tsibley commented Oct 27, 2022

It's a plan!

…most workflows that still use it expect to find fauna as a sibling directory of the workflow directory.

Yeah, the Docker runtime intentionally locates Fauna's source at /nextstrain/fauna so its a sibling of /nextstrain/build (the working dir into which the host dir is mounted). I wonder if the Conda runtime could help arrange a similar sibling dir somehow too during nextstrain build. Will idly ponder it.

@tsibley tsibley assigned tsibley and huddlej and unassigned tsibley Oct 27, 2022
@huddlej
Copy link
Contributor Author

huddlej commented Oct 27, 2022

@tsibley Based on our other in person discussion yesterday, I wonder if we should focus instead on migrating remaining fauna-based workflows to our S3-hosted data approach. This is always the issue of deciding how long to keep supporting a legacy system that everyone relies on, but if I had to choose between a) running the seasonal flu workflow as it is with managed conda environment and b) running the seasonal flu workflow with S3-hosted data, I would pick the latter.

@tsibley
Copy link
Member

tsibley commented Oct 27, 2022

@huddlej Ah, indeed, my preference would be to advocate for (b) too, but I guess I don't see this as having to be either-or. I don't think it'll take very long to make the nextstrain-fauna-rethinkdb package. I'd be happy to Just Do It (and stop if it turns out not to be easy), but also wanted to support your interest in learning about using our Conda channel. :-)

@huddlej
Copy link
Contributor Author

huddlej commented Nov 1, 2022

I'd rather push for the S3-hosted data, instead, unless anyone else on the team has a strong desire to run fauna-based builds with the managed Conda environment right now. This might just be @joverlee521 and @j23414 right now?

@joverlee521
Copy link
Contributor

I only run the fauna-based builds with Docker so no desire to include fauna/rethinkdb here. I would be happy to push for (b).

@huddlej huddlej moved this from New to Backlog in Nextstrain planning (archived) Nov 2, 2022
@tsibley
Copy link
Member

tsibley commented Dec 8, 2022

Closing this as won't fix. We can re-open if (b) doesn't come to pass in a reasonable time and we decide to just make nextstrain-fauna-rethinkdb.

@tsibley tsibley closed this as not planned Won't fix, can't repro, duplicate, stale Dec 8, 2022
Repository owner moved this from Backlog to Done in Nextstrain planning (archived) Dec 8, 2022
@corneliusroemer
Copy link
Member

Came here from nextstrain/docker-base#222

A few thoughts:

rethinkdb does not have an official conda package for the Python bindings. There is an unofficial package for version 2.0.0. Even if fauna works with this older version of rethinkdb (I haven't checked), I don't think we should rely on channels from individual users.
To include the specific rethinkdb version we need for fauna in this base environment, it looks like we need to create our own conda package for this version. We could host it in the Nextstrain channel. Bioconda is not an appropriate place for it and I don't think we want to support a conda-forge package for rethinkdb (or give the impression we are responsible for rethinkdb).

I don't see why one couldn't create a conda-forge package and include this particular version, as well as more recent versions.

Creating a conda-forge package in no way implies responsibility for underlying software.

With rethinkdb only releasing new versions very sporadically, it wouldn't be hard to keep it up to date - there's also no obligation to do so.

So in principle I don't see why one couldn't add fauna to bioconda and the dependencies to conda-forge/bioconda as required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

No branches or pull requests

4 participants