-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate to Nextclade v3 #281
Conversation
Since the v3 Nextclade datasets for MPXV and hMPXV both use NC_063383.1 as the reference, we can just run Nextclade once using the MPXV dataset.
Move the downloaded Nextclade dataset to the `data/` directory so that it's automatically ignored by git. Move Nextclade outputs to `results` as they are part of the final outputs of the ingest workflow.
Modified from <nextstrain/measles@faebd64>. I switched the order of `augur curate rename` and `tsv-select` because the Nextclade data includes fields that we don't use that hit the csv field size limit. This commit also includes automated reformatting by `snakefmt`.
Update the `files_to_upload` to match the output files from Nextclade v3.
Remove reference to nextalign and update available files to match the output files from Nextclade v3.
98b856c
to
6b5153c
Compare
Trial run successfully completed and uploaded results to There are many changes across the metadata.tsv compared to the production metadata.tsv because of changes in the Nextclade columns. A majority of these changes come from the I took a quick look at clade changes:
Detailed clade change counts
These changes seem reasonable to me, but would like a quick review from @corneliusroemer |
Adding log/benchmark to all rules in nextclade.smk according to the Nextstrain Snakemake style guide <https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html>
Chatter on Slack regarding workflow runtimes made me realize that since we only run a single Nextclade job now, we no longer need to put a hard limit on the threads. <https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1729629899871479?thread_ts=1729629155.721859&cid=C01LCTT7JNN>
@@ -30,7 +30,7 @@ rule run_nextclade: | |||
# The lambda is used to deactivate automatic wildcard expansion. | |||
# https://github.com/snakemake/snakemake/blob/384d0066c512b0429719085f2cf886fdb97fd80a/snakemake/rules.py#L997-L1000 | |||
translations=lambda w: "results/translations/{cds}.fasta", | |||
threads: 4 | |||
threads: workflow.cores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shaved off 8 mins for the Nextclade job, but only 3 mins off the total workflow.
Using all of the cores prevents Snakemake from scheduling other concurrent jobs such as upload_to_s3
jobs. The upload jobs can be slow as well, so we could adjust the threads to something like workflow.cores * 0.75
.
It's only a couple minute difference so I'm not going to worry about it for now.
Description of proposed changes
Migrates the ingest workflow to Nextclade v3, which simplifies the workflow since we no longer need to use two Nextclade datasets. Includes additional clean ups to use
augur curate rename
andaugur merge
, see commits for details.Related issue(s)
Resolves #280
Checklist