-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine tuning the Nextclade all dataset #58
Conversation
6913add
to
c029f1d
Compare
Due to seeing similar issues with genotype-level datasets (e.g. denv4), I'm either going to either expand the scope of this PR to also fix DENV1-4 or split those out into separate PRs. I'll start by adding outgroups/reconstructed ancestral sequences for each genotype-level dataset, perhaps as suggested by: nextstrain/nextclade_data#203 (comment) |
Manually fix serotype annotations for MZ284953, MZ285732, MZ285058, MW332572 that were flagged during nextclade (all) serotype testing. #58 (comment) Add comments to the annotations.tsv file to explain the changes.
Incorporated edits and summarized said edits in nextstrain/nextclade_data#203 (comment) |
Manually fix serotype annotations for genbank samples that were flagged during nextclade (all) serotype testing. #58 (comment) Add comments to the annotations.tsv file to explain the changes.
Manually fix serotype annotations for genbank samples that were flagged during nextclade (all) serotype testing. #58 (comment) Add comments to the annotations.tsv file to explain the changes.
Manually fix serotype annotations for genbank samples that were flagged during nextclade (all) serotype testing. #58 (comment) Add comments to the annotations.tsv file to explain the changes.
Manually fix serotype annotations for genbank samples that were flagged during nextclade (all) serotype testing. #58 (comment) Add comments to the annotations.tsv file to explain the changes.
This grabs root sequence from https://nextstrain.org/dengue/all/genome and swaps it in for the DENV4 reference used for the all serotypes tree.
Alignment of new root to original reference contained no gaps, so just swapping in the new root sequence ID should be sufficient. The GFF is going to be used in augur translate to annotate the tree, so it needs to match the root sequence.
Implemented a combination of the following commits: * Use wildcard for optional flag: https://github.com/nextstrain/mpox/pull/254/files#r1616298870 * Use refine flags from mpox: https://github.com/nextstrain/mpox/blob/7f5adc3cab0c3034e455882581cc081f26f5ebbe/nextclade/Snakefile#L338-L342
8909cea
to
1e6b119
Compare
The genotype-level datasets require further improvement to meet the desired standards. However, the serotype-level dataset is functioning as expected. To solidify the progress made with the serotype-level dataset, I move to merge the changes and shift the focus to enhancing the genotype-level datasets in a new PR. This approach helps me avoid mixing completed tasks with those that still need refinement. |
Missed updating the pathogen.json during earlier commit: 61c6c65
``` cat phylogenetic/auspice/dengue_denv1_genome.json \ | tr '{' '\n' \ | tr ',' '\n' \ | grep -A1 root_sequence \ | sed 's/ "root_sequence":/>root_sequence_denv1/g' \ | sed 's/"nuc": "//g' \ | sed 's/"//g' \ > root_denv1.fasta ```
6b79e5f
to
a42bd91
Compare
* Use reconstructed roots for serotype-level and genotype-level datasets * Update the all dataset with root and gap penalty * Update the dengue/all dataset README.md file
a42bd91
to
212f411
Compare
Description of proposed changes
When testing the
dengue/all
(serotype-level) dataset for accuracy, multiple people realized there was a trend of false-positive DENV4 classification. This PR mostly fixes that.The
dengue/all
dataset was improved by:all
tree.This fix was inspired by multiple sources of feedback, and the mpox codebase.
Related issue(s)
Checklist