Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add yellow fever virus dataset #220

Merged
merged 12 commits into from
Oct 18, 2024
3 changes: 2 additions & 1 deletion data/nextstrain/collection.json
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@
"nextstrain/flu/h3n2/pb2",
"nextstrain/measles",
corneliusroemer marked this conversation as resolved.
Show resolved Hide resolved
"nextstrain/measles/N450/WHO-2012",
"nextstrain/dengue/all"
"nextstrain/dengue/all",
"nextstrain/yellow-fever/prM-E"
]
}
3 changes: 3 additions & 0 deletions data/nextstrain/yellow-fever/prM-E/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## Unreleased

Initial release of yellow fever virus dataset.
60 changes: 60 additions & 0 deletions data/nextstrain/yellow-fever/prM-E/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Yellow fever virus dataset
genehack marked this conversation as resolved.
Show resolved Hide resolved

| Key | Value |
| ----------------- | -----------------------------------------------------------------|
| name | Yellow fever virus (YFV) prM-E region |
| authors | [Nextstrain](https://nextstrain.org) |
| reference | AY640589.1 |
| workflow | <https://github.com/nextstrain/yellow-fever/tree/main/nextclade> |
| path | `nextstrain/yellow-fever/prM-E` |

## Scope of this dataset

This dataset assigns clades to yellow fever virus samples based on
strain and genotype information from [Mutebi et al.][] (J Virol. 2001
Aug;75(15):6999-7008) and [Bryant et al.][] (PLoS Pathog. 2007 May 18;3(5):e75)

These two papers, collectively, define 7 distinct yellow fever virus
genotypes based on a 670 nucleotide region of the yellow fever virus
genome, (bases 641-1310), called the prM-E region. This region
comprises the 3' end of the pre-membrane protein (prM) gene, the
entire membrane protein (M) gene, and the 5' end of the envelope
protein (E) gene.

The clades we annotate (Clade I-VII) are roughly equivalent with the
following genotypes as described in the aforementioned two papers:

| Clade | Genotype |
|-----------|---------------------|
| Clade I | Angola |
| Clade II | East Africa |
| Clade III | East Central/Africa |
| Clade IV | West Africa I |
| Clade V | West Africa II |
| Clade VI | South America I |
| Clade VII | South America II |

(N.b., the reference sequence used in this data set is actually 672nt
long, from bases 641-1312 of the genome reference. The 2 extra bases
make the reference an complete open reading frame.)
genehack marked this conversation as resolved.
Show resolved Hide resolved

This dataset can be used to assign genotypes to any sequence that
includes at least 500 bp of the prM-E region, including whole genome
sequences. Sequence data beyond the prM-E region will be reported as an
insertion in the Nextclade output.

## Features

This dataset supports:

- Assignment of genotypes
- Phylogenetic placement
- Sequence quality control (QC)

## What are Nextclade datasets

Read more about Nextclade datasets in the Nextclade documentation:
<https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html>

[Mutebi et al.]: https://pubmed.ncbi.nlm.nih.gov/11435580/
[Bryant et al.]: https://pubmed.ncbi.nlm.nih.gov/17511518/
5 changes: 5 additions & 0 deletions data/nextstrain/yellow-fever/prM-E/genome_annotation.gff3
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
##sequence-region prM-E 1 672
NC_002031.1 feature source 1 672 . + . gene=nuc
NC_002031.1 feature gene 1 333 . + . gene_name=prM
NC_002031.1 feature gene 109 333 . + . gene_name=M
NC_002031.1 feature gene 334 672 . + . gene_name=E
52 changes: 52 additions & 0 deletions data/nextstrain/yellow-fever/prM-E/pathogen.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
{
"files": {
"reference": "reference.fasta",
"pathogenJson": "pathogen.json",
"genomeAnnotation": "genome_annotation.gff3",
"treeJson": "tree.json",
"examples": "sequences.fasta",
"readme": "README.md",
"changelog": "CHANGELOG.md"
},
"attributes": {
"name": "Yellow fever virus (YFV) prM-E region",
"reference name": "Asibi",
"reference accession": "AY640589.1"
},
"schemaVersion": "3.0.0",
"alignmentParams": {
"minSeedCover": 0.01,
"minLength": 500
genehack marked this conversation as resolved.
Show resolved Hide resolved
},
"qc": {
"missingData": {
"enabled": true,
"missingDataThreshold": 20,
"scoreBias": 4
},
"mixedSites": {
"enabled": true,
"mixedSitesThreshold": 4
},
"frameShifts": {
"enabled": true
},
"stopCodons": {
"enabled": true
},
"privateMutations": {
"enabled": true,
"cutoff": 8,
"typical": 2,
genehack marked this conversation as resolved.
Show resolved Hide resolved
"weightLabeledSubstitutions": 1,
"weightReversionSubstitutions": 1,
"weightUnlabeledSubstitutions": 1
},
"snpClusters": {
"enabled": true,
"clusterCutOff": 3,
"scoreWeight": 50,
"windowSize": 50
}
}
}
13 changes: 13 additions & 0 deletions data/nextstrain/yellow-fever/prM-E/reference.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
> prM-E region (genome 641-1312, 672 nt)
CCAAGAGAGGAGCCAGATGACATTGATTGCTGGTGCTATGGGGTGGAAAACGTTAGAGTC
GCATATGGTAAGTGTGACTCAGCAGGCAGGTCTAGGAGGTCAAGAAGGGCCATTGACTTG
CCTACGCATGAAAACCATGGTTTGAAGACCCGGCAAGAAAAATGGATGACTGGAAGAATG
GGTGAAAGGCAACTCCAAAAGATTGAGAGATGGCTCGTGAGGAACCCCTTTTTTGCAGTG
ACAGCTCTGACCATTGCCTACCTTGTGGGAAGCAACATGACGCAACGAGTCGTGATTGCC
CTACTGGTCTTGGCTGTTGGTCCGGCCTACTCAGCTCACTGCATTGGAATTACTGACAGG
GATTTCATTGAGGGGGTGCATGGAGGAACTTGGGTTTCAGCTACCCTGGAGCAAGACAAG
TGTGTCACTGTTATGGCCCCTGACAAGCCTTCATTGGACATCTCACTAGAGACAGTAGCC
ATTGATGGACCTGCTGAGGCGAGGAAAGTGTGTTACAATGCAGTTCTCACTCATGTGAAG
ATTAATGACAAGTGCCCCAGCACTGGAGAGGCCCACCTAGCTGAAGAGAACGAAGGGGAC
AATGCGTGCAAGCGCACTTATTCTGATAGAGGCTGGGGCAATGGCTGTGGCCTATTTGGG
AAAGGGAGCATT
Loading