Segment builds for H5N1 cattle flu outbreak

Individual datasets for all 8 segments are added using NCBI data (consistent with the whole-genome view) and using the strains included in the genome tree to choose the appropriate monophyletic clade in each segment tree to include. There is an added Auspice colouring "genome_tree" which indicates whether the strain exists in the whole genome tree. See top-level README for instructions on how to run. There are a number of important caveats with the current implementation, all of which are fixable but I think better done in future work. * The input source must be NCBI for this build to work as expected and thus the `s3_src` config argument is essential. Snakemake will not reuse existing files in `./data` if they originated from a different source (e.g. GISAID), despite their filenames being identical as the params of the originating rule are different. I've added a note about this in the README. * The filenames produced include "_all-time" as the current Snakemake workflow requires a "time" wildcard, however we want to remove this prior to uploading. E.g. `auspice/avian-flu_h5n1-cattle-outbreak_ha_all-time.json` should be uploaded to /avian-flu/h5n1-cattle-outbreak/ha to match the URL structure for the genome build. We can use the following bash one-liner prior to uploading while this remains unfixed: ``` for i in auspice/avian-flu_h5n1-cattle-outbreak_*_all-time.json; do mv $i ${i%_all-time.json}.json; done ``` * The segment builds may include basal strains which aren't part of the cattle-flu outbreak but are included because there aren't mutations which distinguish them from others which should be. The "genome_tree" colouring is helpful here. (See the note in `restrict-via-common-ancestor.py` -- we may want to include more basal strains here).
nextstrain · Jul 12, 2024 · 3d65a9b · 3d65a9b
1 parent 75e4139
commit 3d65a9b
Show file tree

Hide file tree

Showing 5 changed files with 234 additions and 37 deletions.
diff --git a/README.md b/README.md
@@ -1,12 +1,13 @@
 # nextstrain.org/avian-flu
 
-This is the Nextstrain build for avian influenza subtypes A/H5N1, A/H5NX, A/H7N9, and A/H9N2.
+This is the Nextstrain build for avian influenza subtypes A/H5N1, A/H5NX, A/H7N9, and A/H9N2 as well as analysis of the 2024 A/H5N1 cattle-flu outbreak.
 The most up-to-date builds of avian influenza can be found [on nextstrain.org](https://nextstrain.org/avian-flu).
 Please see [nextstrain.org/docs](https://nextstrain.org/docs) for details about augur and pathogen builds.
 
-## Building
+## Segment-level GISAID builds
 
-All 32 builds (4 subtypes x 8 segments) can be built by running `snakemake`.
+The default Snakemake pipeline builds 32 Auspice datasets (8 segments x 4 subtypes (A/H5N1, A/H5NX, A/H7N9, A/H9N2)),
+and can be run via `snakemake` with no config overrides. 
 This pipeline starts by downloading data from a private S3 bucket and the appropriate credentials are required; see below for how to use locally ingested files.
 For rapid AWS rebuild run as:
 
@@ -25,10 +26,57 @@ to nextstrain.org by running:
 nextstrain build . deploy_all
 ```
 
+## H5N1 Cattle Outbreak (2024)
+
+We produce per-segment and whole-genome (concatenated segments) builds for the ongoing H5N1 cattle-flu outbreak.
+These use NCBI data including consensus genomes and SRA data assembled via the Andersen lab's [avian-influenza repo](https://github.com/andersen-lab/avian-influenza).
+
+> Running this build will overwrite GISAID files in `./data` and thus you can't maintain or run GISAID & NCBI builds in parallel. In most cases this isn't an issue and [we are working on improving this](https://github.com/nextstrain/avian-flu/issues/70). You may want to proactively remove the `./data` directory yourself to make sure everything works as expected.
+
+#### Whole genome build
+
+An up-to-date version of this build is available at [nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome](https://nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome).
+
+``` bash
+nextstrain build \
+    --env AWS_ACCESS_KEY_ID \
+    --env AWS_SECRET_ACCESS_KEY \
+    . \
+        --snakefile Snakefile.genome \
+        --config s3_src=s3://nextstrain-data/files/workflows/avian-flu/h5n1
+```
+
+The build is restricted to a set of strains where we think there's no reassortment, with outgroups excluded (`config/dropped_strains_h5n1-cattle-outbreak.txt`).
+Output files will be placed in `results/h5n1-cattle-outbreak/genome`.
+See `Snakefile.genome` for more details.
+
+#### Segment-level builds
+
+Strains for each segment are chosen by first constructing a general tree for the segment with all strains from 2024 onwards and then taking the clade which contains all strains in the genome build.
+This should allow any reassortments to be highlighted and will also include outbreak strains which are missing from the genome build (because they don't have all 8 segments sequenced).
+
+
+``` bash
+nextstrain build \
+    --env AWS_ACCESS_KEY_ID \
+    --env AWS_SECRET_ACCESS_KEY \
+    . \
+        --config s3_src=s3://nextstrain-data/files/workflows/avian-flu/h5n1 \
+        -pf \
+        auspice/avian-flu_h5n1-cattle-outbreak_pb2_all-time.json \
+        auspice/avian-flu_h5n1-cattle-outbreak_pb1_all-time.json \
+        auspice/avian-flu_h5n1-cattle-outbreak_pa_all-time.json \
+        auspice/avian-flu_h5n1-cattle-outbreak_ha_all-time.json \
+        auspice/avian-flu_h5n1-cattle-outbreak_np_all-time.json \
+        auspice/avian-flu_h5n1-cattle-outbreak_na_all-time.json \
+        auspice/avian-flu_h5n1-cattle-outbreak_mp_all-time.json \
+        auspice/avian-flu_h5n1-cattle-outbreak_ns_all-time.json
+```
+
 ## Creating a custom build
 The easiest way to generate your own, custom avian-flu build is to use the quickstart-build as a starting template. Simply clone the quickstart-build, run with the example data, and edit the Snakefile to customize. This build includes example data and a simplified, heavily annotated Snakefile that goes over the structure of Snakefiles and annotates rules and inputs/outputs that can be modified. This build, with it's own readme, is available [here](https://github.com/nextstrain/avian-flu/tree/master/quickstart-build).
 
-### Features unique to avian flu builds
+## Features unique to avian flu builds
 
 #### cleavage site annotations
 Influenza virus HA is translated as a single peptide (HA0) that is cleaved to form the mature, functional form (HA1 and HA2). In all human strains and many avian strains, the cleavage site is composed of a single, basic amino acid residue. However, some avian influenza subtypes, particularly H5s, have acquired additional basic residues immediately preceding the HA cleavage site. In some cases, this results in addition of a furin cleavage motif, allowing HA to be cleaved by furin, which is ubiquitously expressed, and allows for viral replication across a range of tissues. The addition of this "polybasic cleavage site" is one of the prime determinants of avian influenza virulence. In these builds, we have annotated whether strains contain a furin cleavage motif, defined here as the sequence `R-X-K/R-R` immediately preceding the start of HA2, where `X` can be any amino acid. We have also added a color by for the cleavage site sequence, which we define here as the 4 bases preceding HA2.
@@ -73,26 +121,6 @@ Run the pipeline with `--config 'local_ingest=True'` to use the locally availabl
 Specifically, the files needed are `ingest/results/metadata.tsv` and `ingest/results/sequences_{SEGMENT}.fasta`.
 
 
-#### Running full genome builds
-
-Run full genome builds with the following command.
-
-``` bash
-nextstrain build \
-    --env AWS_ACCESS_KEY_ID \
-    --env AWS_SECRET_ACCESS_KEY \
-    . \
-        --snakefile Snakefile.genome \
-        --config s3_src=s3://nextstrain-data/files/workflows/avian-flu/h5n1
-```
-
-Currently this is only set up for the "h5n1-cattle-outbreak" build using NCBI data,
-and the build is restricted to a set of strains where we think there's no reassortment, with outgroups
-excluded in (`config/dropped_strains_h5n1-cattle-outbreak.txt`).
-Output files will be placed in `results/h5n1-cattle-outbreak/genome`.
-See `Snakefile.genome` for more details.
-
-
 ### To modify this build to work with your own data
 Although the simplest way to generate a custom build is via the quickstart build, you are welcome to clone this repo and use it as a starting point for running your own, local builds if you'd like. The [Nextstrain docs](https://docs.nextstrain.org/en/latest/index.html) are a fantastic resource for getting started with the Nextstrain pipeline, and include some [great tutorials](https://docs.nextstrain.org/en/latest/install.html) to get you started. This build is slightly more complicated than other builds, and has a few custom functions in it to accommodate all the features on [nextstrain.org](https://nextstrain.org/avian-flu), and makes use of wildcards for both subtypes and gene segments. If you'd like to adapt this full, non-simplified pipeline here to your own data (which you may want to do if you also want to annotate clades), you would need to make a few changes and additions:
 

diff --git a/Snakefile b/Snakefile
@@ -1,4 +1,5 @@
 include: "rules/common.smk"
+include: "rules/cattle-flu.smk"
 
 SUBTYPES = config.get('subtypes', ["h5nx", "h5n1", "h7n9", "h9n2"])
 SEGMENTS = config.get('segments', ["pb2", "pb1", "pa", "ha","np", "na", "mp", "ns"])
@@ -11,6 +12,11 @@ TARGET_SEQUENCES_PER_TREE = config.get('n_seqs', 3000)
 # (2) Filter the other segments by simply force-including the same strains as (1)
 SAME_STRAINS = bool(config.get('same_strains_per_segment', False))
 
+# constrain the wildcards to not include `_` which we use to separate "parts" of filenames (where a part may be a wildcard itself)
+wildcard_constraints:
+    subtype = "[^_]+",
+    segment = "[^_]+",
+    time = "[^_]+",
 
 def all_targets():
     return [
@@ -33,19 +39,23 @@ rule test_target:
 
 rule files:
     params:
-        dropped_strains = "config/dropped_strains_{subtype}.txt",
-        include_strains = "config/include_strains_{subtype}_{time}.txt",
-        reference = "config/reference_{subtype}_{segment}.gb",
-        colors = "config/colors_{subtype}.tsv",
-        lat_longs = "config/lat_longs_{subtype}.tsv",
+        dropped_strains = lambda w: "config/dropped_strains_{subtype}.txt" if not w.subtype=='h5n1-cattle-outbreak' else "config/dropped_strains_h5n1.txt",
+        include_strains = lambda w: "config/include_strains_{subtype}_{time}.txt" if not w.subtype=='h5n1-cattle-outbreak' else "config/include_strains_{subtype}.txt",
+        reference = lambda w: "config/reference_{subtype}_{segment}.gb" if not w.subtype=='h5n1-cattle-outbreak' else "config/reference_h5n1_{segment}.gb",
+        colors = lambda w: "config/colors_{subtype}.tsv" if not w.subtype=='h5n1-cattle-outbreak' else "config/colors_h5n1.tsv",
+        # TODO - Augur 24.4.0 includes extensive lat-longs by default - can we drop the following avian-flu specific ones?
+        lat_longs = lambda w: "config/lat_longs_{subtype}.tsv" if not w.subtype=='h5n1-cattle-outbreak' else "config/lat_longs_h5n1.tsv",
         auspice_config = "config/auspice_config_{subtype}.json",
-        clades_file = "clade-labeling/{subtype}-clades.tsv",
-        description = "config/description.md"
+        clades_file = lambda w: "clade-labeling/{subtype}-clades.tsv" if not w.subtype=='h5n1-cattle-outbreak' else "clade-labeling/h5n1-clades.tsv",
+        description = lambda w: "config/description.md" if not w.subtype=='h5n1-cattle-outbreak' else "config/description_{subtype}.md",
 
 files = rules.files.params
 
 def metadata_by_wildcards(wildcards):
-    if wildcards.subtype in ("h5n1", "h5nx"):
+    # H5 builds have extra clade-level metadata added to the metadata TSV.
+    # We may move this to a node-data JSON which would simplify the snakemake logic
+    # a bit -- see <https://github.com/nextstrain/avian-flu/issues/25>
+    if wildcards.subtype in ("h5n1", "h5nx", "h5n1-cattle-outbreak"):
         return "results/metadata-with-clade_{subtype}.tsv"
     else:
         return "data/metadata_{subtype}.tsv"
@@ -57,6 +67,8 @@ def group_by(w):
         'h7n9': {'all-time': 'division year'},
         'h9n2': {'all-time': 'country year'}
         }
+    if w.subtype == 'h5n1-cattle-outbreak':
+        return ''
     return gb[w.subtype][w.time]
 
 def min_length(w):
@@ -65,6 +77,8 @@ def min_length(w):
     return(length)
 
 def min_date(w):
+    if w.subtype == 'h5n1-cattle-outbreak':
+        return "2024"
     date = {
         'h5nx': {'all-time': '1996', '2y': '2Y'},
         'h5n1': {'all-time': '1996', '2y': '2Y'},
@@ -75,6 +89,7 @@ def min_date(w):
 
 def traits_columns(w):
     traits = {'h5nx':'region','h5n1': 'region country', 'h7n9': 'country division', 'h9n2': 'region country'}
+    traits['h5n1-cattle-outbreak'] = traits['h5n1']
     return traits[w.subtype]
 
 def clock_rate(w):
@@ -104,7 +119,8 @@ def clock_rate(w):
         'h5nx': {'all-time':'', '2y': clock_rates_h5nx[w.segment]},
         'h5n1': {'all-time':'', '2y': clock_rates_h5n1[w.segment]},
         'h7n9': {'all-time':''},
-        'h9n2': {'all-time':''}
+        'h9n2': {'all-time':''},
+        'h5n1-cattle-outbreak': {'all-time': clock_rates_h5n1[w.segment]}
         }
 
     return clock_rate[w.subtype][w.time]
@@ -115,7 +131,8 @@ def clock_rate_std_dev(w):
         'h5nx': {'all-time': '', '2y': '--clock-std-dev 0.00211'},
         'h5n1': {'all-time': '', '2y': '--clock-std-dev 0.00211'},
         'h7n9': {'all-time': ''},
-        'h9n2': {'all-time': ''}
+        'h9n2': {'all-time': ''},
+        'h5n1-cattle-outbreak': {'all-time': '--clock-std-dev 0.00211'}
         }
 
     return clock_rate_std_dev[w.subtype][w.time]
@@ -159,12 +176,18 @@ def _filter_params(wildcards, input, output, threads, resources):
     # strains may not have all segments, but that's preferable to filtering them out.
     restrict_n_segments = f"n_segments!={len(SEGMENTS)}" if SAME_STRAINS else ''
 
+    ## By default we filter out unknown country/region, but skip this for the cattle-flu outbreak just
+    ## in case there's something interesting with limited metadata
+    require_location = ' country=? region=? ' if wildcards.subtype!='h5n1-cattle-outbreak' else ''
+
+    grouping = f" --group-by {group_by(wildcards)}" if group_by(wildcards) else ""
+
     # formulate our typical filtering parameters
-    cmd  = f" --group-by {group_by(wildcards)}"
+    cmd  = grouping
     cmd += f" --subsample-max-sequences {TARGET_SEQUENCES_PER_TREE}"
     cmd += f" --min-date {min_date(wildcards)}"
     cmd += f" --include {input.include}"
-    cmd += f" --exclude-where host=laboratoryderived host=ferret host=unknown host=other host=host country=? region=? gisaid_clade=3C.2 {restrict_n_segments}"
+    cmd += f" --exclude-where host=laboratoryderived host=ferret host=unknown host=other host=host {require_location} gisaid_clade=3C.2 {restrict_n_segments}"
     cmd += f" --min-length {min_length(wildcards)}"
     cmd += f" --non-nucleotide"
     return cmd
@@ -274,6 +297,13 @@ rule refine:
         """
 
 def refined_tree(w):
+    """
+    Return the refined tree to be used for export, traits, ancestry reconstruction etc
+    The cattle-flu build introduces an additional step beyond `augur refine`, which is
+    why this function exists.
+    """
+    if w.subtype=='h5n1-cattle-outbreak':
+        return "results/tree_{subtype}_{segment}_{time}_outbreak-clade.nwk"
     return "results/tree_{subtype}_{segment}_{time}.nwk"
 
 rule ancestral:
@@ -347,14 +377,28 @@ rule cleavage_site:
         """
 
 def export_node_data_files(wildcards):
-    return [
+    nd = [
         rules.refine.output.node_data,
         rules.traits.output.node_data,
         rules.ancestral.output.node_data,
         rules.translate.output.node_data,
         rules.cleavage_site.output.cleavage_site_annotations,
         rules.cleavage_site.output.cleavage_site_sequences,
     ]
+    if wildcards.subtype=="h5n1-cattle-outbreak":
+        nd.append("results/tree_{subtype}_{segment}_{time}_outbreak-clade.json")
+    return nd
+
+
+def additional_export_config(wildcards):
+    args = ""
+
+    if wildcards.subtype == "h5n1-cattle-outbreak":
+        # The auspice-config is for the whole genome analysis, so override the title
+        segment = wildcards.segment.upper()
+        args += f"--title 'Ongoing influenza A/H5N1 cattle outbreak in North America ({segment} segment)'"
+
+    return args
 
 rule export:
     message: "Exporting data files for for auspice"
@@ -368,6 +412,8 @@ rule export:
         description = files.description
     output:
         auspice_json = "auspice/avian-flu_{subtype}_{segment}_{time}.json"
+    params:
+        additional_config = additional_export_config
     shell:
         """
         augur export v2 \
@@ -379,6 +425,7 @@ rule export:
             --auspice-config {input.auspice_config} \
             --description {input.description} \
             --include-root-sequence-inline \
+            {params.additional_config} \
             --output {output.auspice_json}
         """
 

diff --git a/rules/cattle-flu.smk b/rules/cattle-flu.smk
@@ -0,0 +1,38 @@
+
+rule download_tree:
+    """
+    Downloads the tree behind nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome
+    so that the segment-level builds can use it to restrict the strains used.
+    TODO: if the whole-genome analysis has been run locally we should (optionally) use that tree.
+    """
+    output:
+        tree = "results/tree_{subtype}_genome.json",
+    params:
+        dataset="https://data.nextstrain.org/avian-flu_h5n1-cattle-outbreak_genome.json"
+    wildcard_constraints:
+        subtype="h5n1-cattle-outbreak",
+        time="all-time",
+    shell:
+        """
+        curl --compressed {params.dataset} -o {output.tree}
+        """
+
+
+rule prune_tree:
+    input:
+        tree = "results/tree_{subtype}_{segment}_{time}.nwk",
+        strains = "results/tree_{subtype}_genome.json",
+    output:
+        tree = "results/tree_{subtype}_{segment}_{time}_outbreak-clade.nwk",
+        node_data = "results/tree_{subtype}_{segment}_{time}_outbreak-clade.json",
+    wildcard_constraints:
+        subtype="h5n1-cattle-outbreak",
+        time="all-time",
+    shell:
+        """
+        python3 scripts/restrict-via-common-ancestor.py \
+            --tree {input.tree} \
+            --strains {input.strains} \
+            --output-tree {output.tree} \
+            --output-metadata {output.node_data}
+        """
diff --git a/rules/common.smk b/rules/common.smk
@@ -10,6 +10,7 @@ def subtypes_by_subtype_wildcard(wildcards):
         'h7n9': ['h7n9'],
         'h9n2': ['h9n2'],
     }
+    db['h5n1-cattle-outbreak'] = [*db['h5nx']]
     return(db[wildcards.subtype])
 
 if LOCAL_INGEST: