Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine tuning the Nextclade all dataset #58

Merged
merged 12 commits into from
Jun 5, 2024
Merged

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented May 30, 2024

Description of proposed changes

When testing the dengue/all (serotype-level) dataset for accuracy, multiple people realized there was a trend of false-positive DENV4 classification. This PR mostly fixes that.

Screenshot 2024-05-31 at 10 36 55 AM

The dengue/all dataset was improved by:

  1. Adding some "--penalty-gap-*" attributes to promote contiguous alignment (instead of gappy)
  2. Adding a reconstructed root into the all tree.
Screenshot 2024-05-31 at 10 42 01 AM

This fix was inspired by multiple sources of feedback, and the mpox codebase.

Related issue(s)

Checklist

  • Checks pass
  • The serotype assignment should mostly match the NCBI GenBank annotations

@j23414 j23414 force-pushed the nextclade-all-dataset-tuning branch 4 times, most recently from 6913add to c029f1d Compare June 3, 2024 16:53
@j23414
Copy link
Contributor Author

j23414 commented Jun 3, 2024

Due to seeing similar issues with genotype-level datasets (e.g. denv4), I'm either going to either expand the scope of this PR to also fix DENV1-4 or split those out into separate PRs.

Screenshot 2024-06-03 at 1 33 41 PM

I'll start by adding outgroups/reconstructed ancestral sequences for each genotype-level dataset, perhaps as suggested by: nextstrain/nextclade_data#203 (comment)

j23414 added a commit that referenced this pull request Jun 4, 2024
Manually fix serotype annotations for MZ284953, MZ285732, MZ285058, MW332572
that were flagged during nextclade (all) serotype testing.

#58 (comment)

Add comments to the annotations.tsv file to explain the changes.
@j23414
Copy link
Contributor Author

j23414 commented Jun 4, 2024

Incorporated edits and summarized said edits in nextstrain/nextclade_data#203 (comment)
This is ready for review and dataset evaluation.

@j23414 j23414 marked this pull request as ready for review June 4, 2024 17:21
j23414 added a commit that referenced this pull request Jun 4, 2024
Manually fix serotype annotations for genbank samples that were flagged
during nextclade (all) serotype testing.

#58 (comment)

Add comments to the annotations.tsv file to explain the changes.
j23414 added a commit that referenced this pull request Jun 4, 2024
Manually fix serotype annotations for genbank samples that were flagged
during nextclade (all) serotype testing.

#58 (comment)

Add comments to the annotations.tsv file to explain the changes.
j23414 added a commit that referenced this pull request Jun 4, 2024
Manually fix serotype annotations for genbank samples that were flagged
during nextclade (all) serotype testing.

#58 (comment)

Add comments to the annotations.tsv file to explain the changes.
j23414 added a commit that referenced this pull request Jun 4, 2024
Manually fix serotype annotations for genbank samples that were flagged
during nextclade (all) serotype testing.

#58 (comment)

Add comments to the annotations.tsv file to explain the changes.
trvrb and others added 4 commits June 4, 2024 15:04
This grabs root sequence from https://nextstrain.org/dengue/all/genome and swaps it in for the DENV4 reference used for the all serotypes tree.
Alignment of new root to original reference contained no gaps, so just swapping
in the new root sequence ID should be sufficient. The GFF is going to be used in
augur translate to annotate the tree, so it needs to match the root sequence.
@j23414 j23414 force-pushed the nextclade-all-dataset-tuning branch from 8909cea to 1e6b119 Compare June 4, 2024 22:05
@j23414
Copy link
Contributor Author

j23414 commented Jun 5, 2024

The genotype-level datasets require further improvement to meet the desired standards. However, the serotype-level dataset is functioning as expected.

To solidify the progress made with the serotype-level dataset, I move to merge the changes and shift the focus to enhancing the genotype-level datasets in a new PR.

This approach helps me avoid mixing completed tasks with those that still need refinement.

j23414 added 7 commits June 5, 2024 11:40
Missed updating the pathogen.json during earlier commit:

61c6c65
```
cat phylogenetic/auspice/dengue_denv1_genome.json \
  | tr '{' '\n' \
  | tr ',' '\n' \
  | grep -A1 root_sequence \
  | sed 's/ "root_sequence":/>root_sequence_denv1/g' \
  | sed 's/"nuc": "//g' \
  | sed 's/"//g' \
  > root_denv1.fasta
```
@j23414 j23414 force-pushed the nextclade-all-dataset-tuning branch from 6b79e5f to a42bd91 Compare June 5, 2024 18:41
* Use reconstructed roots for serotype-level and genotype-level datasets
* Update the all dataset with root and gap penalty
* Update the dengue/all dataset README.md file
@j23414 j23414 force-pushed the nextclade-all-dataset-tuning branch from a42bd91 to 212f411 Compare June 5, 2024 18:43
@j23414 j23414 requested a review from a team June 5, 2024 22:06
@j23414 j23414 merged commit 9bd013f into main Jun 5, 2024
32 checks passed
@j23414 j23414 deleted the nextclade-all-dataset-tuning branch June 5, 2024 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants