Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add B.1 build, downsample B.1 in non-B.1 builds, rename to mpox #171

Merged
merged 25 commits into from
Sep 20, 2023

Conversation

corneliusroemer
Copy link
Member

@corneliusroemer corneliusroemer commented Aug 10, 2023

New build names/URLs no longer include the deprecated monkeypox name

New paths are as follows (this can be discussed):

  • monkeypox/mpxv -> mpox/mpxv
  • monkeypox/hmpxv1 -> mpox/mpxv-IIb (use official clade IIb in name, instead of unofficial hmpxv1)

Add new build:

  • mpox/mpxv-IIb-B.1 which is purely B.1 and sub-lineages

Change subsampling of the two non-B.1 builds to include only relatively few B.1s so as to allow examination of non-B.1 diversity. Those who want to study B.1 should use that dedicated build instead.

Sampling for the two non-B.1 builds is so that (almost) all available good quality non-B.1 sequences are included.

The B.1 build includes almost all B.1 sequences.

This is how the IIb build looks like for example:
image

Testing

Follow up

  • Change nextstrain.org manifest to point to new build urls (done)

The AUGUR_RECURSION_LIMIT is set to 10,000 by default since Augur 22.0.0
and the workflow's minimum required Augur version is now 22.2.0¹

¹ https://github.com/nextstrain/monkeypox/blob/5216a13c407d0e842b40951907c49de6fc6b967a/Snakefile#L5
@corneliusroemer corneliusroemer changed the base branch from master to trial-builds August 10, 2023 18:21
@corneliusroemer corneliusroemer marked this pull request as ready for review August 11, 2023 13:22
@rneher
Copy link
Member

rneher commented Aug 11, 2023

Thanks, Cornelius. This looks very good to me!

joverlee521 and others added 20 commits August 11, 2023 10:17
Previously, either config_hmpxv1.yaml was used as the default as-is or
all required configuration were to be specified via Snakemake CLI
options. This resulted in much duplication of configuration across
different usages, and the default was never actually used in practice
except for in CI out of convenience of not specifying --configfile.

Switch to a two-tiered configuration setup with (1) a base "default"
tier of common values that can be applied to all workflow usages and (2)
a second tier of usage-specific config values which can also partially override

Since some required config entries (e.g. build_name) do not share common
values across the existing configs, this results in a workflow that is
only usable when additional configuration is defined. This will be
addressed by subsequent commits.
Copy values from the hmpxv1 configuration since that was the previous
default, but change the names to be boilerplate defaults.

Add required entries so that:

1. The default config file serves as a reference for all required keys.
2. The workflow can be run by CI without specifying an additional
   --configfile.
Original reasoning by @tsibley in nextstrain/cli@fab709a:

Running on push _and_ PRs is often redundant.  For PRs, we really care
about the putative merge of the PR branch, and that's what "on:
pull_request" tests.  We typically do not need push-level CI results for
PRs.  On the other hand, CI results for every push to master are nice to
have both as a safety backstop and for the linear chain of CI history it
produces (e.g. to debug the impact of external changes on our CI).

The primary downside I see is that you can no longer push without
opening a PR just to see what CI says, but I think that's an acceptable
tradeoff, especially now that draft PRs are a thing.  To mitigate this
downside, "on: workflow_dispatch" allows CI to be manually dispatched
for a specific branch/tag/commit if you _really_ don't want to open even
a draft PR.

Trimming unnecessary CI jobs reduces the time to completion for CI runs
(good for the dev loop) and reduces organization-level job queuing,
which can negatively impact the workflows of other repos.

Co-authored-by: Thomas Sibley <[email protected]>
subrepo:
  subdir:   "ingest/vendored"
  merged:   "c97df23"
upstream:
  origin:   "https://github.com/nextstrain/ingest"
  branch:   "main"
  commit:   "c97df23"
git-subrepo:
  version:  "0.4.6"
  origin:   "https://github.com/ingydotnet/git-subrepo"
  commit:   "110b9eb"
`fetch-from-ncbi-virus` from the subrepo is a copy of
`fetch-from-genbank` with some extra options. Remove the copy in this
repo and update references.

The supporting scripts `csv-to-ndjson` and `genbank-url` can also be
removed.
Replace our custom fetch scripts that uses the NCBI Virus API with the
NCBI Datasets CLI commands.

NCBI datasets downloads a virus dataset ZIP file that includes a
metadata JSON Lines file and a sequences FASTA file. To maintain a record
of the single NDJSON file on S3, extract the sequences FASTA file and
format the metadata into a TSV file that are parsed into a single NDJSON
file using `augur curate passthru`. The metadata TSV is created using
the NCBI `dataformat` command so that we do not have to parse the nested
JSON lines files ourselves and header fields are renamed to match the
previous fields we used for NCBI Virus.

The NDJSON file created here no longer includes equivalent fields
for "title" or "publication".
ingest: Switch to NCBI Datasets CLI to fetch data
Bumping to 68GiB to keep at c5.9xlarge instance since there's some
required "headspace" for Batch¹

¹ https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1674586033950349?thread_ts=1674254476.788549&cid=C01LCTT7JNN
This should have been done as a part of #179,
but I totally missed that we have this section in the build's
description.
description: Update "Underlying data" section to NCBI Datasets
Add a config variable `auspice_prefix` that's prepended to auspice.json
And allow setting of that variable in github action submission
New names no longer include the deprecated `monkeypox` name

New mpxv and mpxv-IIb builds downsample B.1* sequences so that they don't overwhelm the tree

Add an mpxv-IIb-B.1 build that contains only B.1* sequences
@corneliusroemer corneliusroemer merged commit f2bed24 into trial-builds Sep 20, 2023
4 checks passed
@corneliusroemer corneliusroemer deleted the mpxv-named-builds branch September 20, 2023 18:33
@victorlin
Copy link
Member

The "commits" and "files changed" tabs are contain unrelated commits/diffs, I think due to the way the rebase was applied. This should be more reflective of actual changes from this PR: 942c1d0...c735b6a

@victorlin victorlin mentioned this pull request Sep 27, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

4 participants