-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add B.1 build, downsample B.1 in non-B.1 builds, rename to mpox #171
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The AUGUR_RECURSION_LIMIT is set to 10,000 by default since Augur 22.0.0 and the workflow's minimum required Augur version is now 22.2.0¹ ¹ https://github.com/nextstrain/monkeypox/blob/5216a13c407d0e842b40951907c49de6fc6b967a/Snakefile#L5
corneliusroemer
force-pushed
the
mpxv-named-builds
branch
from
August 10, 2023 18:21
4410522
to
50f5b47
Compare
Thanks, Cornelius. This looks very good to me! |
rneher
approved these changes
Aug 11, 2023
victorlin
requested changes
Aug 11, 2023
Snakefile: remove AUGUR_RECURSION_LIMIT
Previously, either config_hmpxv1.yaml was used as the default as-is or all required configuration were to be specified via Snakemake CLI options. This resulted in much duplication of configuration across different usages, and the default was never actually used in practice except for in CI out of convenience of not specifying --configfile. Switch to a two-tiered configuration setup with (1) a base "default" tier of common values that can be applied to all workflow usages and (2) a second tier of usage-specific config values which can also partially override Since some required config entries (e.g. build_name) do not share common values across the existing configs, this results in a workflow that is only usable when additional configuration is defined. This will be addressed by subsequent commits.
Copy values from the hmpxv1 configuration since that was the previous default, but change the names to be boilerplate defaults. Add required entries so that: 1. The default config file serves as a reference for all required keys. 2. The workflow can be run by CI without specifying an additional --configfile.
Original reasoning by @tsibley in nextstrain/cli@fab709a: Running on push _and_ PRs is often redundant. For PRs, we really care about the putative merge of the PR branch, and that's what "on: pull_request" tests. We typically do not need push-level CI results for PRs. On the other hand, CI results for every push to master are nice to have both as a safety backstop and for the linear chain of CI history it produces (e.g. to debug the impact of external changes on our CI). The primary downside I see is that you can no longer push without opening a PR just to see what CI says, but I think that's an acceptable tradeoff, especially now that draft PRs are a thing. To mitigate this downside, "on: workflow_dispatch" allows CI to be manually dispatched for a specific branch/tag/commit if you _really_ don't want to open even a draft PR. Trimming unnecessary CI jobs reduces the time to completion for CI runs (good for the dev loop) and reduces organization-level job queuing, which can negatively impact the workflows of other repos. Co-authored-by: Thomas Sibley <[email protected]>
subrepo: subdir: "ingest/vendored" merged: "c97df23" upstream: origin: "https://github.com/nextstrain/ingest" branch: "main" commit: "c97df23" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"
`fetch-from-ncbi-virus` from the subrepo is a copy of `fetch-from-genbank` with some extra options. Remove the copy in this repo and update references. The supporting scripts `csv-to-ndjson` and `genbank-url` can also be removed.
Replace our custom fetch scripts that uses the NCBI Virus API with the NCBI Datasets CLI commands. NCBI datasets downloads a virus dataset ZIP file that includes a metadata JSON Lines file and a sequences FASTA file. To maintain a record of the single NDJSON file on S3, extract the sequences FASTA file and format the metadata into a TSV file that are parsed into a single NDJSON file using `augur curate passthru`. The metadata TSV is created using the NCBI `dataformat` command so that we do not have to parse the nested JSON lines files ourselves and header fields are renamed to match the previous fields we used for NCBI Virus. The NDJSON file created here no longer includes equivalent fields for "title" or "publication".
ingest: Switch to NCBI Datasets CLI to fetch data
Bumping to 68GiB to keep at c5.9xlarge instance since there's some required "headspace" for Batch¹ ¹ https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1674586033950349?thread_ts=1674254476.788549&cid=C01LCTT7JNN
This should have been done as a part of #179, but I totally missed that we have this section in the build's description.
description: Update "Underlying data" section to NCBI Datasets
Add a config variable `auspice_prefix` that's prepended to auspice.json And allow setting of that variable in github action submission
New names no longer include the deprecated `monkeypox` name New mpxv and mpxv-IIb builds downsample B.1* sequences so that they don't overwhelm the tree Add an mpxv-IIb-B.1 build that contains only B.1* sequences
corneliusroemer
force-pushed
the
mpxv-named-builds
branch
from
September 20, 2023 16:39
8426e3d
to
ea5e21b
Compare
1 task
1 task
The "commits" and "files changed" tabs are contain unrelated commits/diffs, I think due to the way the rebase was applied. This should be more reflective of actual changes from this PR: 942c1d0...c735b6a |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New build names/URLs no longer include the deprecated
monkeypox
nameNew paths are as follows (this can be discussed):
monkeypox/mpxv
->mpox/mpxv
monkeypox/hmpxv1
->mpox/mpxv-IIb
(use officialclade IIb
in name, instead of unofficialhmpxv1
)Add new build:
mpox/mpxv-IIb-B.1
which is purely B.1 and sub-lineagesChange subsampling of the two non-B.1 builds to include only relatively few B.1s so as to allow examination of non-B.1 diversity. Those who want to study B.1 should use that dedicated build instead.
Sampling for the two non-B.1 builds is so that (almost) all available good quality non-B.1 sequences are included.
The B.1 build includes almost all B.1 sequences.
This is how the IIb build looks like for example:
Testing
Follow up