Skip to content

Commit

Permalink
Segment builds for H5N1 cattle flu outbreak
Browse files Browse the repository at this point in the history
Individual datasets for all 8 segments are added using NCBI data
(consistent with the whole-genome view) and using the strains included
in the genome tree to choose the appropriate monophyletic clade in each
segment tree to include.

There is an added Auspice colouring "genome_tree" which indicates
whether the strain exists in the whole genome tree.

See top-level README for instructions on how to run.

There are a number of important caveats with the current implementation,
all of which are fixable but I think better done in future work.

* The input source must be NCBI for this build to work as expected and
thus the `s3_src` config argument is essential. Snakemake will not reuse
existing files in `./data` if they originated from a different source
(e.g. GISAID), despite their filenames being identical as the params of
the originating rule are different. I've added a note about this in the
README.

* The filenames produced include "_all-time" as the current Snakemake
workflow requires a "time" wildcard, however we want to remove this
prior to uploading. E.g.
`auspice/avian-flu_h5n1-cattle-outbreak_ha_all-time.json` should be
uploaded to /avian-flu/h5n1-cattle-outbreak/ha to match the URL
structure for the genome build. We can use the following bash one-liner
prior to uploading while this remains unfixed:
```
for i in auspice/avian-flu_h5n1-cattle-outbreak_*_all-time.json; do mv $i ${i%_all-time.json}.json; done
```

* The segment builds may include basal strains which aren't part of the
  cattle-flu outbreak but are included because there aren't mutations
  which distinguish them from others which should be. The "genome_tree"
  colouring is helpful here. (See the note in
  `restrict-via-common-ancestor.py` -- we may want to include more basal
  strains here).
  • Loading branch information
jameshadfield committed Jul 12, 2024
1 parent 75e4139 commit 3d65a9b
Show file tree
Hide file tree
Showing 5 changed files with 234 additions and 37 deletions.
76 changes: 52 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
# nextstrain.org/avian-flu

This is the Nextstrain build for avian influenza subtypes A/H5N1, A/H5NX, A/H7N9, and A/H9N2.
This is the Nextstrain build for avian influenza subtypes A/H5N1, A/H5NX, A/H7N9, and A/H9N2 as well as analysis of the 2024 A/H5N1 cattle-flu outbreak.
The most up-to-date builds of avian influenza can be found [on nextstrain.org](https://nextstrain.org/avian-flu).
Please see [nextstrain.org/docs](https://nextstrain.org/docs) for details about augur and pathogen builds.

## Building
## Segment-level GISAID builds

All 32 builds (4 subtypes x 8 segments) can be built by running `snakemake`.
The default Snakemake pipeline builds 32 Auspice datasets (8 segments x 4 subtypes (A/H5N1, A/H5NX, A/H7N9, A/H9N2)),
and can be run via `snakemake` with no config overrides.
This pipeline starts by downloading data from a private S3 bucket and the appropriate credentials are required; see below for how to use locally ingested files.
For rapid AWS rebuild run as:

Expand All @@ -25,10 +26,57 @@ to nextstrain.org by running:
nextstrain build . deploy_all
```

## H5N1 Cattle Outbreak (2024)

We produce per-segment and whole-genome (concatenated segments) builds for the ongoing H5N1 cattle-flu outbreak.
These use NCBI data including consensus genomes and SRA data assembled via the Andersen lab's [avian-influenza repo](https://github.com/andersen-lab/avian-influenza).

> Running this build will overwrite GISAID files in `./data` and thus you can't maintain or run GISAID & NCBI builds in parallel. In most cases this isn't an issue and [we are working on improving this](https://github.com/nextstrain/avian-flu/issues/70). You may want to proactively remove the `./data` directory yourself to make sure everything works as expected.
#### Whole genome build

An up-to-date version of this build is available at [nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome](https://nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome).

``` bash
nextstrain build \
--env AWS_ACCESS_KEY_ID \
--env AWS_SECRET_ACCESS_KEY \
. \
--snakefile Snakefile.genome \
--config s3_src=s3://nextstrain-data/files/workflows/avian-flu/h5n1
```

The build is restricted to a set of strains where we think there's no reassortment, with outgroups excluded (`config/dropped_strains_h5n1-cattle-outbreak.txt`).
Output files will be placed in `results/h5n1-cattle-outbreak/genome`.
See `Snakefile.genome` for more details.

#### Segment-level builds

Strains for each segment are chosen by first constructing a general tree for the segment with all strains from 2024 onwards and then taking the clade which contains all strains in the genome build.
This should allow any reassortments to be highlighted and will also include outbreak strains which are missing from the genome build (because they don't have all 8 segments sequenced).


``` bash
nextstrain build \
--env AWS_ACCESS_KEY_ID \
--env AWS_SECRET_ACCESS_KEY \
. \
--config s3_src=s3://nextstrain-data/files/workflows/avian-flu/h5n1 \
-pf \
auspice/avian-flu_h5n1-cattle-outbreak_pb2_all-time.json \
auspice/avian-flu_h5n1-cattle-outbreak_pb1_all-time.json \
auspice/avian-flu_h5n1-cattle-outbreak_pa_all-time.json \
auspice/avian-flu_h5n1-cattle-outbreak_ha_all-time.json \
auspice/avian-flu_h5n1-cattle-outbreak_np_all-time.json \
auspice/avian-flu_h5n1-cattle-outbreak_na_all-time.json \
auspice/avian-flu_h5n1-cattle-outbreak_mp_all-time.json \
auspice/avian-flu_h5n1-cattle-outbreak_ns_all-time.json
```

## Creating a custom build
The easiest way to generate your own, custom avian-flu build is to use the quickstart-build as a starting template. Simply clone the quickstart-build, run with the example data, and edit the Snakefile to customize. This build includes example data and a simplified, heavily annotated Snakefile that goes over the structure of Snakefiles and annotates rules and inputs/outputs that can be modified. This build, with it's own readme, is available [here](https://github.com/nextstrain/avian-flu/tree/master/quickstart-build).

### Features unique to avian flu builds
## Features unique to avian flu builds

#### cleavage site annotations
Influenza virus HA is translated as a single peptide (HA0) that is cleaved to form the mature, functional form (HA1 and HA2). In all human strains and many avian strains, the cleavage site is composed of a single, basic amino acid residue. However, some avian influenza subtypes, particularly H5s, have acquired additional basic residues immediately preceding the HA cleavage site. In some cases, this results in addition of a furin cleavage motif, allowing HA to be cleaved by furin, which is ubiquitously expressed, and allows for viral replication across a range of tissues. The addition of this "polybasic cleavage site" is one of the prime determinants of avian influenza virulence. In these builds, we have annotated whether strains contain a furin cleavage motif, defined here as the sequence `R-X-K/R-R` immediately preceding the start of HA2, where `X` can be any amino acid. We have also added a color by for the cleavage site sequence, which we define here as the 4 bases preceding HA2.
Expand Down Expand Up @@ -73,26 +121,6 @@ Run the pipeline with `--config 'local_ingest=True'` to use the locally availabl
Specifically, the files needed are `ingest/results/metadata.tsv` and `ingest/results/sequences_{SEGMENT}.fasta`.


#### Running full genome builds

Run full genome builds with the following command.

``` bash
nextstrain build \
--env AWS_ACCESS_KEY_ID \
--env AWS_SECRET_ACCESS_KEY \
. \
--snakefile Snakefile.genome \
--config s3_src=s3://nextstrain-data/files/workflows/avian-flu/h5n1
```

Currently this is only set up for the "h5n1-cattle-outbreak" build using NCBI data,
and the build is restricted to a set of strains where we think there's no reassortment, with outgroups
excluded in (`config/dropped_strains_h5n1-cattle-outbreak.txt`).
Output files will be placed in `results/h5n1-cattle-outbreak/genome`.
See `Snakefile.genome` for more details.


### To modify this build to work with your own data
Although the simplest way to generate a custom build is via the quickstart build, you are welcome to clone this repo and use it as a starting point for running your own, local builds if you'd like. The [Nextstrain docs](https://docs.nextstrain.org/en/latest/index.html) are a fantastic resource for getting started with the Nextstrain pipeline, and include some [great tutorials](https://docs.nextstrain.org/en/latest/install.html) to get you started. This build is slightly more complicated than other builds, and has a few custom functions in it to accommodate all the features on [nextstrain.org](https://nextstrain.org/avian-flu), and makes use of wildcards for both subtypes and gene segments. If you'd like to adapt this full, non-simplified pipeline here to your own data (which you may want to do if you also want to annotate clades), you would need to make a few changes and additions:

Expand Down
73 changes: 60 additions & 13 deletions Snakefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
include: "rules/common.smk"
include: "rules/cattle-flu.smk"

SUBTYPES = config.get('subtypes', ["h5nx", "h5n1", "h7n9", "h9n2"])
SEGMENTS = config.get('segments', ["pb2", "pb1", "pa", "ha","np", "na", "mp", "ns"])
Expand All @@ -11,6 +12,11 @@ TARGET_SEQUENCES_PER_TREE = config.get('n_seqs', 3000)
# (2) Filter the other segments by simply force-including the same strains as (1)
SAME_STRAINS = bool(config.get('same_strains_per_segment', False))

# constrain the wildcards to not include `_` which we use to separate "parts" of filenames (where a part may be a wildcard itself)
wildcard_constraints:
subtype = "[^_]+",
segment = "[^_]+",
time = "[^_]+",

def all_targets():
return [
Expand All @@ -33,19 +39,23 @@ rule test_target:

rule files:
params:
dropped_strains = "config/dropped_strains_{subtype}.txt",
include_strains = "config/include_strains_{subtype}_{time}.txt",
reference = "config/reference_{subtype}_{segment}.gb",
colors = "config/colors_{subtype}.tsv",
lat_longs = "config/lat_longs_{subtype}.tsv",
dropped_strains = lambda w: "config/dropped_strains_{subtype}.txt" if not w.subtype=='h5n1-cattle-outbreak' else "config/dropped_strains_h5n1.txt",
include_strains = lambda w: "config/include_strains_{subtype}_{time}.txt" if not w.subtype=='h5n1-cattle-outbreak' else "config/include_strains_{subtype}.txt",
reference = lambda w: "config/reference_{subtype}_{segment}.gb" if not w.subtype=='h5n1-cattle-outbreak' else "config/reference_h5n1_{segment}.gb",
colors = lambda w: "config/colors_{subtype}.tsv" if not w.subtype=='h5n1-cattle-outbreak' else "config/colors_h5n1.tsv",
# TODO - Augur 24.4.0 includes extensive lat-longs by default - can we drop the following avian-flu specific ones?
lat_longs = lambda w: "config/lat_longs_{subtype}.tsv" if not w.subtype=='h5n1-cattle-outbreak' else "config/lat_longs_h5n1.tsv",
auspice_config = "config/auspice_config_{subtype}.json",
clades_file = "clade-labeling/{subtype}-clades.tsv",
description = "config/description.md"
clades_file = lambda w: "clade-labeling/{subtype}-clades.tsv" if not w.subtype=='h5n1-cattle-outbreak' else "clade-labeling/h5n1-clades.tsv",
description = lambda w: "config/description.md" if not w.subtype=='h5n1-cattle-outbreak' else "config/description_{subtype}.md",

files = rules.files.params

def metadata_by_wildcards(wildcards):
if wildcards.subtype in ("h5n1", "h5nx"):
# H5 builds have extra clade-level metadata added to the metadata TSV.
# We may move this to a node-data JSON which would simplify the snakemake logic
# a bit -- see <https://github.com/nextstrain/avian-flu/issues/25>
if wildcards.subtype in ("h5n1", "h5nx", "h5n1-cattle-outbreak"):
return "results/metadata-with-clade_{subtype}.tsv"
else:
return "data/metadata_{subtype}.tsv"
Expand All @@ -57,6 +67,8 @@ def group_by(w):
'h7n9': {'all-time': 'division year'},
'h9n2': {'all-time': 'country year'}
}
if w.subtype == 'h5n1-cattle-outbreak':
return ''
return gb[w.subtype][w.time]

def min_length(w):
Expand All @@ -65,6 +77,8 @@ def min_length(w):
return(length)

def min_date(w):
if w.subtype == 'h5n1-cattle-outbreak':
return "2024"
date = {
'h5nx': {'all-time': '1996', '2y': '2Y'},
'h5n1': {'all-time': '1996', '2y': '2Y'},
Expand All @@ -75,6 +89,7 @@ def min_date(w):

def traits_columns(w):
traits = {'h5nx':'region','h5n1': 'region country', 'h7n9': 'country division', 'h9n2': 'region country'}
traits['h5n1-cattle-outbreak'] = traits['h5n1']
return traits[w.subtype]

def clock_rate(w):
Expand Down Expand Up @@ -104,7 +119,8 @@ def clock_rate(w):
'h5nx': {'all-time':'', '2y': clock_rates_h5nx[w.segment]},
'h5n1': {'all-time':'', '2y': clock_rates_h5n1[w.segment]},
'h7n9': {'all-time':''},
'h9n2': {'all-time':''}
'h9n2': {'all-time':''},
'h5n1-cattle-outbreak': {'all-time': clock_rates_h5n1[w.segment]}
}

return clock_rate[w.subtype][w.time]
Expand All @@ -115,7 +131,8 @@ def clock_rate_std_dev(w):
'h5nx': {'all-time': '', '2y': '--clock-std-dev 0.00211'},
'h5n1': {'all-time': '', '2y': '--clock-std-dev 0.00211'},
'h7n9': {'all-time': ''},
'h9n2': {'all-time': ''}
'h9n2': {'all-time': ''},
'h5n1-cattle-outbreak': {'all-time': '--clock-std-dev 0.00211'}
}

return clock_rate_std_dev[w.subtype][w.time]
Expand Down Expand Up @@ -159,12 +176,18 @@ def _filter_params(wildcards, input, output, threads, resources):
# strains may not have all segments, but that's preferable to filtering them out.
restrict_n_segments = f"n_segments!={len(SEGMENTS)}" if SAME_STRAINS else ''

## By default we filter out unknown country/region, but skip this for the cattle-flu outbreak just
## in case there's something interesting with limited metadata
require_location = ' country=? region=? ' if wildcards.subtype!='h5n1-cattle-outbreak' else ''

grouping = f" --group-by {group_by(wildcards)}" if group_by(wildcards) else ""

# formulate our typical filtering parameters
cmd = f" --group-by {group_by(wildcards)}"
cmd = grouping
cmd += f" --subsample-max-sequences {TARGET_SEQUENCES_PER_TREE}"
cmd += f" --min-date {min_date(wildcards)}"
cmd += f" --include {input.include}"
cmd += f" --exclude-where host=laboratoryderived host=ferret host=unknown host=other host=host country=? region=? gisaid_clade=3C.2 {restrict_n_segments}"
cmd += f" --exclude-where host=laboratoryderived host=ferret host=unknown host=other host=host {require_location} gisaid_clade=3C.2 {restrict_n_segments}"
cmd += f" --min-length {min_length(wildcards)}"
cmd += f" --non-nucleotide"
return cmd
Expand Down Expand Up @@ -274,6 +297,13 @@ rule refine:
"""

def refined_tree(w):
"""
Return the refined tree to be used for export, traits, ancestry reconstruction etc
The cattle-flu build introduces an additional step beyond `augur refine`, which is
why this function exists.
"""
if w.subtype=='h5n1-cattle-outbreak':
return "results/tree_{subtype}_{segment}_{time}_outbreak-clade.nwk"
return "results/tree_{subtype}_{segment}_{time}.nwk"

rule ancestral:
Expand Down Expand Up @@ -347,14 +377,28 @@ rule cleavage_site:
"""

def export_node_data_files(wildcards):
return [
nd = [
rules.refine.output.node_data,
rules.traits.output.node_data,
rules.ancestral.output.node_data,
rules.translate.output.node_data,
rules.cleavage_site.output.cleavage_site_annotations,
rules.cleavage_site.output.cleavage_site_sequences,
]
if wildcards.subtype=="h5n1-cattle-outbreak":
nd.append("results/tree_{subtype}_{segment}_{time}_outbreak-clade.json")
return nd


def additional_export_config(wildcards):
args = ""

if wildcards.subtype == "h5n1-cattle-outbreak":
# The auspice-config is for the whole genome analysis, so override the title
segment = wildcards.segment.upper()
args += f"--title 'Ongoing influenza A/H5N1 cattle outbreak in North America ({segment} segment)'"

return args

rule export:
message: "Exporting data files for for auspice"
Expand All @@ -368,6 +412,8 @@ rule export:
description = files.description
output:
auspice_json = "auspice/avian-flu_{subtype}_{segment}_{time}.json"
params:
additional_config = additional_export_config
shell:
"""
augur export v2 \
Expand All @@ -379,6 +425,7 @@ rule export:
--auspice-config {input.auspice_config} \
--description {input.description} \
--include-root-sequence-inline \
{params.additional_config} \
--output {output.auspice_json}
"""

Expand Down
38 changes: 38 additions & 0 deletions rules/cattle-flu.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@

rule download_tree:
"""
Downloads the tree behind nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome
so that the segment-level builds can use it to restrict the strains used.
TODO: if the whole-genome analysis has been run locally we should (optionally) use that tree.
"""
output:
tree = "results/tree_{subtype}_genome.json",
params:
dataset="https://data.nextstrain.org/avian-flu_h5n1-cattle-outbreak_genome.json"
wildcard_constraints:
subtype="h5n1-cattle-outbreak",
time="all-time",
shell:
"""
curl --compressed {params.dataset} -o {output.tree}
"""


rule prune_tree:
input:
tree = "results/tree_{subtype}_{segment}_{time}.nwk",
strains = "results/tree_{subtype}_genome.json",
output:
tree = "results/tree_{subtype}_{segment}_{time}_outbreak-clade.nwk",
node_data = "results/tree_{subtype}_{segment}_{time}_outbreak-clade.json",
wildcard_constraints:
subtype="h5n1-cattle-outbreak",
time="all-time",
shell:
"""
python3 scripts/restrict-via-common-ancestor.py \
--tree {input.tree} \
--strains {input.strains} \
--output-tree {output.tree} \
--output-metadata {output.node_data}
"""
1 change: 1 addition & 0 deletions rules/common.smk
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ def subtypes_by_subtype_wildcard(wildcards):
'h7n9': ['h7n9'],
'h9n2': ['h9n2'],
}
db['h5n1-cattle-outbreak'] = [*db['h5nx']]
return(db[wildcards.subtype])

if LOCAL_INGEST:
Expand Down
Loading

0 comments on commit 3d65a9b

Please sign in to comment.