Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect ingest and phylogenetic workflows to follow pathogen-repo-guide #19

Merged
merged 6 commits into from
Mar 16, 2024

Conversation

kimandrews
Copy link
Contributor

Description of proposed changes

Connect ingest and phylogenetic workflows, following the pathogen-repo-guide:

  • Upload ingest output to S3
  • Download ingest output from S3 to phylogenetic directory
  • Use accession column as ID column for phylogenetic analysis (addresses this github issue: Use accession number in phylogenetic workflow #11)
  • Add color scheme matching the new region name format

This PR focuses on making changes that are necessary to move forward with creating a locus-specific workflow in a later PR.

Related issue(s)

#11

Checklist

  • Checks pass

@kimandrews kimandrews requested a review from a team March 11, 2024 18:28
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kimandrews!

I skimmed through the code changes and they look alright, but didn't review in depth. Thanks for uploading the new metadata/sequences to S3.

I ran through the phylogenetic workflow and it works as expected. The steps to upload the dataset to s3://nextstrain-data (i.e. to make it live) aren't yet here, but with the new ingest & phylogenetics being part of the master branch I think the new dataset should be made live. You could do this manually now and add the code in a later PR, but we should try to keep the pipeline in-sync with the live dataset.

Auspice flags the following error, which is not part of your PR but should be fixed at some point:

[Genome annotation] V has length 899 which is not a multiple of 3

An aside I noticed here: We should allow augur tree and augur align to run with more than one thread. You can see an example of how to connect snakemake and augur's --nthreads argument here.

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes here look good to me, I've only left a couple of non-blocking comments.

Feel free to merge and get other fixes in via separate PRs.

phylogenetic/defaults/config.yaml Outdated Show resolved Hide resolved
phylogenetic/example_data/metadata.tsv Outdated Show resolved Hide resolved
The same samples are used for example data as were previously used,
but now they have been pulled from the ingest output.
Ambiguous dates were manually fixed in the example data for samples
JN635406.1, JN635408.1, and EF565859.1, based on dates found here:
https://github.com/nextstrain/fauna/blob/dd20a1fad51433e0bdc206f4300635c0f93f8599/source-data/measles_date_fix.tsv
@kimandrews kimandrews force-pushed the use-new-ingest-output branch from 5390f10 to 9e09647 Compare March 14, 2024 23:34
@kimandrews kimandrews merged commit 0855c99 into main Mar 16, 2024
32 checks passed
@kimandrews kimandrews deleted the use-new-ingest-output branch March 16, 2024 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants