Fine tuning the Nextclade all dataset #58

j23414 · 2024-05-30T22:19:39Z

Description of proposed changes

When testing the dengue/all (serotype-level) dataset for accuracy, multiple people realized there was a trend of false-positive DENV4 classification. This PR mostly fixes that.

The dengue/all dataset was improved by:

Adding some "--penalty-gap-*" attributes to promote contiguous alignment (instead of gappy)
Adding a reconstructed root into the all tree.

This fix was inspired by multiple sources of feedback, and the mpox codebase.

Related issue(s)

Add workflow for producing the Nextclade dengue dataset #21

Checklist

Checks pass
The serotype assignment should mostly match the NCBI GenBank annotations

j23414 · 2024-06-03T20:38:56Z

Due to seeing similar issues with genotype-level datasets (e.g. denv4), I'm either going to either expand the scope of this PR to also fix DENV1-4 or split those out into separate PRs.

I'll start by adding outgroups/reconstructed ancestral sequences for each genotype-level dataset, perhaps as suggested by: nextstrain/nextclade_data#203 (comment)

Manually fix serotype annotations for MZ284953, MZ285732, MZ285058, MW332572 that were flagged during nextclade (all) serotype testing. #58 (comment) Add comments to the annotations.tsv file to explain the changes.

j23414 · 2024-06-04T17:20:37Z

Incorporated edits and summarized said edits in nextstrain/nextclade_data#203 (comment)
This is ready for review and dataset evaluation.

Manually fix serotype annotations for genbank samples that were flagged during nextclade (all) serotype testing. #58 (comment) Add comments to the annotations.tsv file to explain the changes.

This grabs root sequence from https://nextstrain.org/dengue/all/genome and swaps it in for the DENV4 reference used for the all serotypes tree.

Alignment of new root to original reference contained no gaps, so just swapping in the new root sequence ID should be sufficient. The GFF is going to be used in augur translate to annotate the tree, so it needs to match the root sequence.

Implemented a combination of the following commits: * Use wildcard for optional flag: https://github.com/nextstrain/mpox/pull/254/files#r1616298870 * Use refine flags from mpox: https://github.com/nextstrain/mpox/blob/7f5adc3cab0c3034e455882581cc081f26f5ebbe/nextclade/Snakefile#L338-L342

j23414 · 2024-06-05T17:49:47Z

The genotype-level datasets require further improvement to meet the desired standards. However, the serotype-level dataset is functioning as expected.

To solidify the progress made with the serotype-level dataset, I move to merge the changes and shift the focus to enhancing the genotype-level datasets in a new PR.

This approach helps me avoid mixing completed tasks with those that still need refinement.

Missed updating the pathogen.json during earlier commit: 61c6c65

``` cat phylogenetic/auspice/dengue_denv1_genome.json \ | tr '{' '\n' \ | tr ',' '\n' \ | grep -A1 root_sequence \ | sed 's/ "root_sequence":/>root_sequence_denv1/g' \ | sed 's/"nuc": "//g' \ | sed 's/"//g' \ > root_denv1.fasta ```

* Use reconstructed roots for serotype-level and genotype-level datasets * Update the all dataset with root and gap penalty * Update the dengue/all dataset README.md file

j23414 mentioned this pull request May 30, 2024

Add workflow for producing the Nextclade dengue dataset #25

Merged

2 tasks

j23414 force-pushed the nextclade-all-dataset-tuning branch 4 times, most recently from 6913add to c029f1d Compare June 3, 2024 16:53

This was referenced Jun 4, 2024

Hotfix: Infer ancestral root in the phylogenetic workflow #61

Merged

Manually fix serotype mis-annotations #62

Merged

j23414 marked this pull request as ready for review June 4, 2024 17:21

trvrb and others added 4 commits June 4, 2024 15:04

Swap nextclade dengue/all reference to root sequence

82007bc

This grabs root sequence from https://nextstrain.org/dengue/all/genome and swaps it in for the DENV4 reference used for the all serotypes tree.

Match gff to root sequence

9a3f203

Alignment of new root to original reference contained no gaps, so just swapping in the new root sequence ID should be sufficient. The GFF is going to be used in augur translate to annotate the tree, so it needs to match the root sequence.

Use reference and gff for the all dataset

458137c

j23414 force-pushed the nextclade-all-dataset-tuning branch from 8909cea to 1e6b119 Compare June 4, 2024 22:05

j23414 added 7 commits June 5, 2024 11:40

Penalize gaps for the all dataset

108b8ad

Fill in serotype-level README

6ab9a35

fixup: add example sequences

4722cbd

Missed updating the pathogen.json during earlier commit: 61c6c65

Swap nextclade/denv* references to root sequence

486d128

``` cat phylogenetic/auspice/dengue_denv1_genome.json \ | tr '{' '\n' \ | tr ',' '\n' \ | grep -A1 root_sequence \ | sed 's/ "root_sequence":/>root_sequence_denv1/g' \ | sed 's/"nuc": "//g' \ | sed 's/"//g' \ > root_denv1.fasta ```

Add QC for frameshift and stop codon

43eefe2

Penalize gaps for denvX dataset

98fed9e

Use the inferred root as the reference

4994e89

j23414 force-pushed the nextclade-all-dataset-tuning branch from 6b79e5f to a42bd91 Compare June 5, 2024 18:41

fixup: update dataset to incorporate fixes

212f411

* Use reconstructed roots for serotype-level and genotype-level datasets * Update the all dataset with root and gap penalty * Update the dengue/all dataset README.md file

j23414 force-pushed the nextclade-all-dataset-tuning branch from a42bd91 to 212f411 Compare June 5, 2024 18:43

j23414 requested a review from a team June 5, 2024 22:06

j23414 merged commit 9bd013f into main Jun 5, 2024
32 checks passed

j23414 deleted the nextclade-all-dataset-tuning branch June 5, 2024 22:46

j23414 mentioned this pull request Jun 18, 2024

Setting an outgroup for the "Dengue virus DENVx genotypes" dataset #67

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine tuning the Nextclade all dataset #58

Fine tuning the Nextclade all dataset #58

j23414 commented May 30, 2024 •

edited

Loading

j23414 commented Jun 3, 2024 •

edited

Loading

j23414 commented Jun 4, 2024

j23414 commented Jun 5, 2024

Fine tuning the Nextclade all dataset #58

Fine tuning the Nextclade all dataset #58

Conversation

j23414 commented May 30, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

j23414 commented Jun 3, 2024 • edited Loading

j23414 commented Jun 4, 2024

j23414 commented Jun 5, 2024

j23414 commented May 30, 2024 •

edited

Loading

j23414 commented Jun 3, 2024 •

edited

Loading