Skip to content

Commit

Permalink
Add nextclade workflow [#2]
Browse files Browse the repository at this point in the history
  • Loading branch information
genehack committed Aug 5, 2024
1 parent ff682b1 commit 165eac6
Show file tree
Hide file tree
Showing 20 changed files with 6,569 additions and 0 deletions.
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,13 @@ ingest/benchmarks
ingest/data
ingest/logs
ingest/results
nextclade/auspice
nextclade/benchmarks
nextclade/data
nextclade/dataset
nextclade/logs
nextclade/results
nextclade/test_output
phylogenetic/auspice
phylogenetic/benchmarks
phylogenetic/logs
Expand Down
9 changes: 9 additions & 0 deletions .markdownlintrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
// long lines are okay
"MD013":{
"line_length": 100,
"tables": false
},
// don't require top-level heading on L1
"MD041": false
}
29 changes: 29 additions & 0 deletions nextclade/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Yellow Fever Virus Nextclade Dataset Tree

This workflow creates a phylogenetic tree that can be used as part of
a Nextclade dataset to assign genotypes to yellow fever virus samples
based on [Mutebi et al.][] (J Virol. 2001 Aug;75(15):6999-7008) and
[Bryant et al.][] (PLoS Pathog. 2007 May 18;3(5):e75).

* Build a tree using samples from the `ingest` output, with the following
sampling criteria:
* Force-include the following samples:
* genotype reference strains from the 2 papers cited above
* Assign genotypes to each sample and internal nodes of the tree with
`augur clades`, using clade-defining mutations in `defaults/clades.tsv`
* Provide the following coloring options on the tree:
* Genotype assignment from `augur clades`

## How to create a new tree

* Run the workflow: `nextstrain build .`
* Inspect the output tree by comparing genotype assignments from the following sources:
* `augur clades` output
* If unwanted samples are present in the tree, add them to
`defaults/dropped_strains.tsv` and re-run the workflow
* If any changes are needed to the clade-defining mutations, add
changes to `defaults/clades.tsv` and re-run the workflow
* Repeat as needed

[Mutebi et al.]: https://pubmed.ncbi.nlm.nih.gov/11435580/
[Bryant et al.]: https://pubmed.ncbi.nlm.nih.gov/17511518/
30 changes: 30 additions & 0 deletions nextclade/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
configfile: "defaults/config.yaml"

rule all:
input:
auspice_json = config["files"]["auspice_json"],
nextclade_dataset = "dataset/tree.json",
test_dataset = "test_output",

include: "rules/prepare_sequences.smk"
include: "rules/construct_phylogeny.smk"
include: "rules/annotate_phylogeny.smk"
include: "rules/export.smk"
include: "rules/assemble_dataset.smk"

rule clean:
params:
targets = [
".snakemake",
"auspice",
"benchmarks",
"data",
"dataset",
"logs",
"results",
"test_output",
]
shell:
"""
rm -rfv {params.targets}
"""
46 changes: 46 additions & 0 deletions nextclade/defaults/auspice_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{
"title": "Real-time tracking of yellow fever virus full genome virus evolution",
"maintainers": [
{"name": "John SJ Anderson", "url": "https://bedford.io/team/john-sj-anderson/"},
{"name": "the Nextstrain team", "url": "https://nextstrain.org/team"}
],
"data_provenance": [
{
"name": "GenBank",
"url": "https://www.ncbi.nlm.nih.gov/genbank/"
}
],
"build_url": "https://github.com/nextstrain/yellow-fever",
"colorings": [
{
"key": "gt",
"title": "Genotype",
"type": "categorical"
},
{
"key": "region",
"title": "Region",
"type": "categorical"
},
{
"key": "country",
"title": "Country",
"type": "categorical"
}
],
"geo_resolutions": [
"country",
"region"
],
"display_defaults": {
"map_triplicate": true,
"color_by": "clade_membership"
},
"filters": [
"clade_membership",
"region",
"country",
"author",
"host"
]
}
40 changes: 40 additions & 0 deletions nextclade/defaults/clades.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
clade gene site alt
Angola nuc 111 G
Angola nuc 219 T
Angola nuc 240 C
Angola nuc 246 A
Angola nuc 252 A
Angola nuc 255 A
Angola nuc 291 G
Angola nuc 294 A
Angola nuc 300 A
Angola nuc 315 G
Angola nuc 327 G
Angola nuc 372 A
Angola nuc 420 A
Angola nuc 432 A
Angola nuc 453 T
Angola nuc 492 G
Angola nuc 651 T
Angola nuc 72 A
Angola nuc 81 G
Angola nuc 88 C
Angola nuc 90 A
Angola nuc 99 T
East Africa nuc 171 G
East Africa nuc 438 G
East Africa nuc 45 A
East Africa nuc 468 T
East/Central Africa nuc 228 G
South America I nuc 219 A
South America I nuc 532 A
South America II nuc 114 C
South America II nuc 193 T
South America II nuc 249 A
South America II nuc 639 G
West Africa I nuc 183 G
West Africa I nuc 255 C
West Africa II nuc 270 A
West Africa II nuc 321 T
West Africa II nuc 477 A
West Africa II nuc 93 T
8 changes: 8 additions & 0 deletions nextclade/defaults/colors.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# genotypes assigned by augur clades
clade_membership Angola #3F63CF
clade_membership East Africa #529AB6
clade_membership East/Central Africa #75B681
clade_membership South America I #A6BE55
clade_membership South America II #D4B13F
clade_membership West Africa I #E68133
clade_membership West Africa II #DC2F24
16 changes: 16 additions & 0 deletions nextclade/defaults/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
files:
auspice_config: "defaults/auspice_config.json"
auspice_json: "auspice/tree.json"
clades: "defaults/clades.tsv"
colors: "defaults/colors.tsv"
include: "defaults/include_strains.txt"
reference_prM-E_fasta: "defaults/reference.fasta"
reference_prM-E_gff: "defaults/genome_annotation.gff3"
strain_id_field: "accession"
align_and_extract_prM-E:
min_length: 500
min_seed_cover: 0.01
ancestral:
inference: "joint"
export:
metadata_columns: "strain"
5 changes: 5 additions & 0 deletions nextclade/defaults/genome_annotation.gff3
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
##sequence-region prM-E 1 672
NC_002031.1 feature source 1 672 . + . gene=nuc
NC_002031.1 feature gene 1 333 . + . gene_name=prM
NC_002031.1 feature gene 109 333 . + . gene_name=M
NC_002031.1 feature gene 334 672 . + . gene_name=E
136 changes: 136 additions & 0 deletions nextclade/defaults/include_strains.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Extracted from tables and figures in Mutebi et al. (J Virol. 2001
# Aug;75(15):6999-7008) and Bryant et al. (PLoS Pathog. 2007 May
# 18;3(5):e75)
AF369669
AF369670
AF369671
AY540431
AY540432
AY540433
AY540434
AY540435
U52390
AY540437
AY540438
AY540439
AY540440
AY540441
AY540442
AY540443
AY540444
AY540445
AY540446
AY540447
AY540448
AY540449
AY540450
AY540451
AY540452
AY540453
U23570
AY540454
AY540455
AY540456
AY540457
AY540458
AY540459
AY540460
AY540461
AY540462
AY540463
AY540464
AY540465
AY540466
AY540467
AY540468
AY540469
AY540470
AY540471
AY540472
AY540473
AY540436
U52392
U52395
AF369672
AF369673
AY540475
AY540476
AY540474
U52399
AY540477
AY540478
AF369674
AF369675
AY572535
AY640589
AF369686
U54798
AY603338
AF369676
U52403
AF369677
AF369678
AF368679
AF369680
AF369681
AF369682
AF369683
AF369684
AF369685
AY540479
AY540480
AY161927
AY161928
AY161929
AY161930
AY161931
U52411
AY161933
AY161934
AY161935
U52405
U52407
AY161938
AY161939
AY161940
AY161941
AY161942
AY161943
AY161944
AY161945
AY161946
AY161947
AY161948
AY161949
AY161950
AY161951
GI694115
U89338
AF369687
AF369688
U52413
AF369689
AF369690
AF369691
AF369692
AF369693
AY690831
AY690832
AY690833
DQ872411
DQ872412
AY540481
AY540482
AY540483
AY540484
AY540485
AY540486
AF369694
U52422
AF369695
AF369696
AY540487
AY540488
AY540489
AY540490
AF369697
3 changes: 3 additions & 0 deletions nextclade/defaults/nextclade-dataset/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## Unreleased

Initial release of yellow fever virus dataset.
47 changes: 47 additions & 0 deletions nextclade/defaults/nextclade-dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Yellow fever virus dataset

| Key | Value |
| ----------------- | -----------------------------------------------------------------|
| name | Yellow fever virus (YFV) prM-E region |
| authors | [Nextstrain](https://nextstrain.org) |
| reference | AY640589.1 |
| workflow | <https://github.com/nextstrain/yellow-fever/tree/main/nextclade> |
| path | `nextstrain/yellow-fever/prM-E` |

## Scope of this dataset

This dataset assigns genotypes to yellow fever virus samples based on
strain and genotype information from [Mutebi et al.][] (J Virol. 2001
Aug;75(15):6999-7008) and [Bryant et al.][] (PLoS Pathog. 2007 May 18;3(5):e75)

These two papers, collectively, define 7 distinct yellow fever virus
genotypes based on a 670 nucleotide region of the yellow fever virus
genome, (bases 641-1310), called the prM-E region. This region
comprises the 3' end of the pre-membrane protein (prM) gene, the
entire membrane protein (M) gene, and the 5' end of the envelope
protein (E) gene.

(N.b., the reference sequence used in this data set is actually 672nt
long, from bases 641-1312 of the genome reference. The 2 extra bases
make the reference an complete open reading frame.)

This dataset can be used to assign genotypes to any sequence that
includes at least 500 bp of the prM-E region, including whole genome
sequences. Sequence data beyond the prM-E region will be reported as an
insertion in the Nextclade output.

## Features

This dataset supports:

- Assignment of genotypes
- Phylogenetic placement
- Sequence quality control (QC)

## What are Nextclade datasets

Read more about Nextclade datasets in the Nextclade documentation:
<https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html>

[Mutebi et al.]: https://pubmed.ncbi.nlm.nih.gov/11435580/
[Bryant et al.]: https://pubmed.ncbi.nlm.nih.gov/17511518/
Loading

0 comments on commit 165eac6

Please sign in to comment.