Skip to content

Commit

Permalink
Add nextclade workflow [#2]
Browse files Browse the repository at this point in the history
  • Loading branch information
genehack committed Aug 1, 2024
1 parent ff682b1 commit 23d5d30
Show file tree
Hide file tree
Showing 14 changed files with 607 additions and 0 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,11 @@ ingest/benchmarks
ingest/data
ingest/logs
ingest/results
nextclade/auspice
nextclade/benchmarks
nextclade/data
nextclade/logs
nextclade/results
phylogenetic/auspice
phylogenetic/benchmarks
phylogenetic/logs
Expand Down
26 changes: 26 additions & 0 deletions nextclade/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Yellow Fever Virus Nextclade Dataset Tree

This workflow creates a phylogenetic tree that can be used as part of
a Nextclade dataset to assign genotypes to yellow fever virus samples based on
FIXME reference to those two papers goes here.

* Build a tree using samples from the `ingest` output, with the following
sampling criteria:
* Force-include the following samples:
* genotype reference strains
* Assign genotypes to each sample and internal nodes of the tree with
`augur clades`, using clade-defining mutations in
`defaults/clades.tsv`
* Provide the following coloring options on the tree:
* Genotype assignment from `augur clades`

## How to create a new tree

* Run the workflow: `nextstrain build .`
* Inspect the output tree by comparing genotype assignments from the following sources:
* `augur clades` output
* If unwanted samples are present in the tree, add them to
`defaults/dropped_strains.tsv` and re-run the workflow
* If any changes are needed to the clade-defining mutations, add
changes to `defaults/clades.tsv` and re-run the workflow
* Repeat as needed
25 changes: 25 additions & 0 deletions nextclade/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
configfile: "defaults/config.yaml"

rule all:
input:
auspice_json = config["files"]["auspice_json"],

include: "rules/prepare_sequences.smk"
include: "rules/construct_phylogeny.smk"
include: "rules/annotate_phylogeny.smk"
include: "rules/export.smk"

rule clean:
params:
targets = [
".snakemake",
"auspice",
"benchmarks",
"data",
"logs",
"results",
]
shell:
"""
rm -rfv {params.targets}
"""
59 changes: 59 additions & 0 deletions nextclade/defaults/auspice_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
{
"title": "Real-time tracking of yellow fever virus full genome virus evolution",
"maintainers": [
{"name": "John SJ Anderson", "url": "https://bedford.io/team/john-sj-anderson/"},
{"name": "the Nextstrain team", "url": "https://nextstrain.org/team"}
],
"data_provenance": [
{
"name": "GenBank",
"url": "https://www.ncbi.nlm.nih.gov/genbank/"
}
],
"build_url": "https://github.com/nextstrain/yellow-fever",
"colorings": [
{
"key": "gt",
"title": "Genotype",
"type": "categorical"
},
{
"key": "num_date",
"title": "Date",
"type": "continuous"
},
{
"key": "region",
"title": "Region",
"type": "categorical"
},
{
"key": "country",
"title": "Country",
"type": "categorical"
},
{
"key": "host",
"title": "Host",
"type": "categorical"
}
],
"geo_resolutions": [
"country",
"region"
],
"display_defaults": {
"map_triplicate": true,
"color_by": "region"
},
"filters": [
"clade",
"region",
"country",
"author",
"host"
],
"metadata_columns": [
"author"
]
}
40 changes: 40 additions & 0 deletions nextclade/defaults/clades.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
clade gene site alt
Angola nuc 72 A
Angola nuc 81 G
Angola nuc 88 C
Angola nuc 90 A
Angola nuc 99 T
Angola nuc 111 G
Angola nuc 219 T
Angola nuc 240 C
Angola nuc 246 A
Angola nuc 252 A
Angola nuc 255 A
Angola nuc 291 G
Angola nuc 294 A
Angola nuc 300 A
Angola nuc 315 G
Angola nuc 327 G
Angola nuc 372 A
Angola nuc 420 A
Angola nuc 432 A
Angola nuc 453 T
Angola nuc 492 G
Angola nuc 651 T
East Africa nuc 45 A
East Africa nuc 171 G
East Africa nuc 438 G
East Africa nuc 468 T
East/Central Africa nuc 228 G
West Africa I nuc 183 G
West Africa I nuc 255 C
West Africa II nuc 93 T
West Africa II nuc 270 A
West Africa II nuc 321 T
West Africa II nuc 477 A
South America I nuc 219 A
South America I nuc 532 A
South America II nuc 114 C
South America II nuc 193 T
South America II nuc 249 A
South America II nuc 639 G
8 changes: 8 additions & 0 deletions nextclade/defaults/colors.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# genotypes assigned by augur clades
clade_membership Angola #FCF007
clade_membership East Africa #4B26B1
clade_membership East/Central Africa #E307FC
clade_membership West Africa I #2CFC07
clade_membership West Africa II #9EFC07
clade_membership South America I #996633
clade_membership South America II #FC0740
25 changes: 25 additions & 0 deletions nextclade/defaults/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
files:
auspice_config: "defaults/auspice_config.json"
auspice_json: "auspice/yellow-fever-virus_prM-E.json"
clades: "defaults/clades.tsv"
colors: "defaults/colors.tsv"
include: "defaults/include_strains.txt"
reference_prM-E_fasta: "defaults/yellow-fever-virus-reference_prM-E.fasta"
reference_prM-E_gff: "defaults/yellow-fever-virus-reference_prM-E.gff"
strain_id_field: "accession"
align_and_extract_prM-E:
min_length: 650
min_seed_cover: 0.01
filter:
group_by: "region year"
subsample_max_sequences: 500
min_date: 1927
min_length: 650
refine:
coalescent: "opt"
date_inference: "marginal"
clock_filter_iqd: 4
ancestral:
inference: "joint"
export:
metadata_columns: "strain division location region year host"
134 changes: 134 additions & 0 deletions nextclade/defaults/include_strains.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# FIXME provide reference
AF369669
AF369670
AF369671
AY540431
AY540432
AY540433
AY540434
AY540435
U52390
AY540437
AY540438
AY540439
AY540440
AY540441
AY540442
AY540443
AY540444
AY540445
AY540446
AY540447
AY540448
AY540449
AY540450
AY540451
AY540452
AY540453
U23570
AY540454
AY540455
AY540456
AY540457
AY540458
AY540459
AY540460
AY540461
AY540462
AY540463
AY540464
AY540465
AY540466
AY540467
AY540468
AY540469
AY540470
AY540471
AY540472
AY540473
AY540436
U52392
U52395
AF369672
AF369673
AY540475
AY540476
AY540474
U52399
AY540477
AY540478
AF369674
AF369675
AY572535
AY640589
AF369686
U54798
AY603338
AF369676
U52403
AF369677
AF369678
AF368679
AF369680
AF369681
AF369682
AF369683
AF369684
AF369685
AY540479
AY540480
AY161927
AY161928
AY161929
AY161930
AY161931
U52411
AY161933
AY161934
AY161935
U52405
U52407
AY161938
AY161939
AY161940
AY161941
AY161942
AY161943
AY161944
AY161945
AY161946
AY161947
AY161948
AY161949
AY161950
AY161951
GI694115
U89338
AF369687
AF369688
U52413
AF369689
AF369690
AF369691
AF369692
AF369693
AY690831
AY690832
AY690833
DQ872411
DQ872412
AY540481
AY540482
AY540483
AY540484
AY540485
AY540486
AF369694
U52422
AF369695
AF369696
AY540487
AY540488
AY540489
AY540490
AF369697
13 changes: 13 additions & 0 deletions nextclade/defaults/yellow-fever-virus-reference_prM-E.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
> prM-E region (genome 641-1312, 672 nt)
CCAAGAGAGGAGCCAGATGACATTGATTGCTGGTGCTATGGGGTGGAAAACGTTAGAGTC
GCATATGGTAAGTGTGACTCAGCAGGCAGGTCTAGGAGGTCAAGAAGGGCCATTGACTTG
CCTACGCATGAAAACCATGGTTTGAAGACCCGGCAAGAAAAATGGATGACTGGAAGAATG
GGTGAAAGGCAACTCCAAAAGATTGAGAGATGGTTCGTGAGGAACCCCTTTTTTGCAGTG
ACGGCTCTGACCATTGCCTACCTTGTGGGAAGCAACATGACGCAACGAGTCGTGATTGCC
CTACTGGTCTTGGCTGTTGGTCCGGCCTACTCAGCTCACTGCATTGGAATTACTGACAGG
GATTTCATTGAGGGGGTGCATGGAGGAACTTGGGTTTCAGCTACCCTGGAGCAAGACAAG
TGTGTCACTGTTATGGCCCCTGACAAGCCTTCATTGGACATCTCACTAGAGACAGTAGCC
ATTGATAGACCTGCTGAGGTGAGGAAAGTGTGTTACAATGCAGTTCTCACTCATGTGAAG
ATTAATGACAAGTGCCCCAGCACTGGAGAGGCCCACCTAGCTGAAGAGAACGAAGGGGAC
AATGCGTGCAAGCGCACTTATTCTGATAGAGGCTGGGGCAATGGCTGTGGCCTATTTGGG
AAAGGGAGCATT
5 changes: 5 additions & 0 deletions nextclade/defaults/yellow-fever-virus-reference_prM-E.gff
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
##sequence-region prM-E 1 672
NC_002031.1 feature source 1 672 . + . gene=nuc
NC_002031.1 feature gene 1 334 . + . gene_name=prM
NC_002031.1 feature gene 110 334 . + . gene_name=M
NC_002031.1 feature gene 335 672 . + . gene_name=E
Loading

0 comments on commit 23d5d30

Please sign in to comment.