Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove pathogen-specific tools from base runtimes #7

Open
victorlin opened this issue Nov 22, 2024 · 9 comments
Open

Remove pathogen-specific tools from base runtimes #7

victorlin opened this issue Nov 22, 2024 · 9 comments
Labels
proposal Proposals that warrant further discussion

Comments

@victorlin
Copy link
Member

victorlin commented Nov 22, 2024

This applies to docker-base and conda-base.

Context

Our base image has accumulated various pathogen-specific tools over time, some of which signficantly contribute to build time and image size. By removing these pathogen-specific tools, we can ensure the base image/environment reflects a continually updated version of Nextstrain tools and their dependencies. Using fauna as an example, more detailed reasoning is in nextstrain/fauna#170.

Candidates

  • fauna, only used by avian-flu and seasonal-flu
  • evofr, only used by forecasts-ncov
  • pango_aliasor, only used by forecasts-ncov
  • epiweeks, only used by ncov
  • pathogen-embed, only used by seasonal-flu
  • Remove pangoLEARN docker-base#157
  • ?

Note

This seems like the right move for Fauna, but I'm not sure how far we want to take it. As we expand the number of core pathogens that rely on runtimes, the common base will only get smaller.

Workarounds

The ad-hoc approach of defining/creating custom runtimes by extending a base (examples: 1, 2, 3) has been used internally to some extent. The process is quite involved and not ideal for users at large.

@victorlin victorlin added the proposal Proposals that warrant further discussion label Nov 22, 2024
@huddlej
Copy link

huddlej commented Nov 22, 2024

Generally, I would love to have a way to use pathogen-specific Docker images in our workflows! That's been my dream since we added pango-learn to the base image back in the early pandemic.

For the specific candidates you mentioned for removal, I can make some specific notes:

  • fauna: Most of our flu workflows source data from S3 now, with a separate action that transfers data from fauna to S3. As long as that separate transfer action has fauna in its runtime, we don't strictly need fauna for the main workflows.

  • evofr and pathogen-embed: Although these tools are only publicly used by forecasts-ncov right now, we do plan to use them for other projects (e.g., adding t-SNE embeddings in avian flu workflow). Providing these general tools to the public through our runtimes also encourages their use outside of our group. Maybe we could spend a little time trying to optimizing the installation time/size of these tools in our runtimes? There could be some larger dependencies of these tools that we could replace with slimmer alternatives.

  • pango_aliasor: I can see why you'd want to keep this ncov-specific tool out of the base image, but it's also a great example of a tiny package that is helpful to any ncov analyses outside of forecasts-ncov that it doesn't seem to hurt to keep it.

  • epiweeks: this package is also used by seasonal flu could plausibly used more elsewhere (e.g., measles). It's also a pretty tiny package.

I would also recommend removing would be the pango-learn packages and its binary dependencies of gofasta and minimap2, since all of our pango annotations come from Nextclade now.

@jameshadfield
Copy link
Member

For each pathogen/project that relies on tools that may be removed, create and use a custom runtime that installs the tools. Right now the process may be more involved than it should be, and we should provide a good path for extending the base runtimes (examples: docker, conda).

This is the other half of workflows as programs, namely the "the artifacts/bundling (keyword: buildpacks) side of things", no?

(And yes, we should totally do this if it's at all feasible -- there are a number of times I haven't done something because I know it's going to be such a hassle / burden to make the needed dependency available to our runtimes.)

@tsibley
Copy link
Member

tsibley commented Nov 22, 2024

This is the other half of workflows as programs, namely the "the artifacts/bundling (keyword: buildpacks) side of things", no?

Yes, precisely. The whole idea there is that instead of having runtimes and pathogens separately, we have pathogens that are (or contain) their runtimes. We want to avoid having N pathogens and N×M pathogen-runtimes and making the user match them.

The implementation examples Victor gave (and things like ncov-ingest's image) are coming at this from what I'd call a more ad-hoc approach, and I do not think we should go down that path as a way to get to custom runtimes per pathogen. That way lies ecosystem fragmentation and incurs significant usability costs (to both users and developers, us and others).

There's lots of considerations of this work. For example, our runtimes are not small when installed on disk. We're going to want to be able to share a concrete, installed base across pathogens (not just a conceptual base).

We'll also want to consider the cost vs. benefits of moving something out of the base runtimes; it will have non-trivial overhead (both conceptual and actual) and we should only do it when it's worth it. I'm not convinced many candidates given above meet that threshold? What concretely are we gaining with the removal of each?

@tsibley
Copy link
Member

tsibley commented Nov 22, 2024

@jameshadfield

there are a number of times I haven't done something because I know it's going to be such a hassle / burden to make the needed dependency available to our runtimes

Do you have examples? They would be very helpful to guide both eventual work on this topic but also suggest pain points we might be able to alleviate now with the current base runtimes.

@jameshadfield
Copy link
Member

Do you have examples?

The one I was reminded of with this week's avian-flu work is nextstrain/avian-flu#80. There have been a bunch of others along the lines of "can't use this pip dependency, not in our runtimes" but I managed to find an alternate solution so it wasn't a dealbreaker.

@victorlin
Copy link
Member Author

Thanks for the discussions! It's clear that there is work to be done on the dependencies/runtimes front. I don't think there is an urgent need to remove anything now, but it's nice to have discussions about this alongside the ongoing workflows as programs work. @tsibley has a good point with

The implementation examples Victor gave (and things like ncov-ingest's image) are coming at this from what I'd call a more ad-hoc approach, and I do not think we should go down that path as a way to get to custom runtimes per pathogen. That way lies ecosystem fragmentation and incurs significant usability costs (to both users and developers, us and others).

I've updated the issue description to be more open-ended, noting the ad-hoc approach as the only known workaround in the absence of a better solution.

@huddlej
Copy link

huddlej commented Dec 5, 2024

I'm not convinced many candidates given above meet that threshold? What concretely are we gaining with the removal of each?

I'm 99% sure that no one uses pangoLEARN anymore, especially not the deprecated versions we've pinned. Removing it would allow us to no longer wonder whether we need it anymore (and regain some valuable parking lots ;)). I just asked the original users of these tools on Slack whether they still use these tools.

@huddlej
Copy link

huddlej commented Dec 5, 2024

Dan Lu confirmed on Slack that they're no longer using pangoLEARN. I also realized that I made an issue for this last year. As @joverlee521 pointed out on that issue, we'll need to remove the pangoLEARN bits from ncov, too. We/I can continue further discussion of this on that issue.

@trvrb
Copy link
Member

trvrb commented Dec 5, 2024

I can picture a bunch of gotchas here if suddenly I have to remember to use different specific non-default Docker images when I run avian-flu, ncov, forecasts-ncov, all of which I'm doing quite frequently.

I agree with Tom here

Yes, precisely. The whole idea there is that instead of having runtimes and pathogens separately, we have pathogens that are (or contain) their runtimes. We want to avoid having N pathogens and N×M pathogen-runtimes and making the user match them.

My strong sense is that the gain in reduced Docker image size in stripping these out would not be worth the developer hassle on maintaining multiple images and then the user hassle of knowing which image to use for which workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Proposals that warrant further discussion
Projects
None yet
Development

No branches or pull requests

5 participants