Pull ingest and phylogenetic workflows from guide repo [#2]

Also apply pre-commit cleanup
nextstrain · May 23, 2024 · 3a2acd2 · 3a2acd2
1 parent df1a6dd
commit 3a2acd2
Showing 25 changed files with 1,122 additions and 0 deletions.
diff --git a/ingest/README.md b/ingest/README.md
@@ -0,0 +1,105 @@
+# Ingest
+
+This workflow ingests public data from NCBI and outputs curated
+metadata and sequences that can be used as input for the phylogenetic
+workflow.
+
+## Workflow Usage
+
+The workflow can be run from the top level pathogen repo directory:
+
+```bash
+nextstrain build ingest
+```
+
+Alternatively, the workflow can also be run from within the ingest
+directory:
+
+```bash
+cd ingest
+nextstrain build .
+```
+
+This produces the default outputs of the ingest workflow:
+
+- metadata      = results/metadata.tsv
+- sequences     = results/sequences.fasta
+
+### Dumping the full raw metadata from NCBI Datasets
+
+The workflow has a target for dumping the full raw metadata from NCBI
+Datasets.
+
+```bash
+nextstrain build ingest dump_ncbi_dataset_report
+```
+
+This will produce the file `ingest/data/ncbi_dataset_report_raw.tsv`,
+which you can inspect to determine what fields and data to use if you
+want to configure the workflow for your pathogen.
+
+## Defaults
+
+The defaults directory contains all of the default configurations for
+the ingest workflow.
+
+[defaults/config.yaml](defaults/config.yaml) contains all of the
+default configuration parameters used for the ingest workflow. Use
+Snakemake's `--configfile`/`--config` options to override these
+default values.
+
+## Snakefile and rules
+
+The rules directory contains separate Snakefiles (`*.smk`) as modules
+of the core ingest workflow. The modules of the workflow are in
+separate files to keep the main ingest [Snakefile](Snakefile) succinct
+and organized.
+
+The `workdir` is hardcoded to be the ingest directory so all filepaths
+for inputs/outputs should be relative to the ingest directory.
+
+Modules are all
+[included](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes)
+in the main Snakefile in the order that they are expected to run.
+
+### Nextclade
+
+Nextstrain is pushing to standardize ingest workflows with Nextclade
+runs to include Nextclade outputs in our publicly hosted data.
+However, if a Nextclade dataset does not already exist, it requires
+curated data as input, so we are making Nextclade steps optional here.
+
+If Nextclade config values are included, the Nextclade rules will
+create the final metadata TSV by joining the Nextclade output with the
+metadata. If Nextclade configs are not included, we rename the subset
+metadata TSV to the final metadata TSV.
+
+To run Nextclade rules, include the `defaults/nextclade_config.yaml`
+config file with:
+
+```bash
+nextstrain build ingest --configfile defaults/nextclade_config.yaml
+```
+
+> [!TIP]
+> If the Nextclade dataset is stable and you always want to run the
+> Nextclade rules as part of ingest, we recommend moving the Nextclade
+> related config parameters from the `defaults/nextclade_config.yaml`
+> file to the default config file `defaults/config.yaml`.
+
+## Build configs
+
+The build-configs directory contains custom configs and rules that
+override and/or extend the default workflow.
+
+- [nextstrain-automation](build-configs/nextstrain-automation/) - automated internal Nextstrain builds.
+
+## Vendored
+
+This repository uses
+[`git subrepo`](https://github.com/ingydotnet/git-subrepo) to manage copies
+of ingest scripts in [vendored](vendored), from
+[nextstrain/ingest](https://github.com/nextstrain/ingest).
+
+See [vendored/README.md](vendored/README.md#vendoring) for
+instructions on how to update the vendored scripts.
diff --git a/ingest/Snakefile b/ingest/Snakefile
@@ -0,0 +1,86 @@
+"""
+This is the main ingest Snakefile that orchestrates the full ingest workflow
+and defines its default outputs.
+"""
+
+
+# The workflow filepaths are written relative to this Snakefile's base
+# directory
+workdir: workflow.current_basedir
+
+
+# Use default configuration values. Override with Snakemake's
+# --configfile/--config options.
+configfile: "defaults/config.yaml"
+
+
+# This is the default rule that Snakemake will run when there are no
+# specified targets. The default output of the ingest workflow is
+# usually the curated metadata and sequences. Nextstrain-maintained
+# ingest workflows will produce metadata files with the standard
+# Nextstrain fields and additional fields that are pathogen specific.
+# We recommend using these standard fields in custom ingests as well
+# to minimize the customizations you will need for the downstream
+# phylogenetic workflow.
+
+
+# TODO: Add link to centralized docs on standard Nextstrain metadata fields
+rule all:
+    input:
+        "results/sequences.fasta",
+        "results/metadata.tsv",
+
+
+# Note that only PATHOGEN-level customizations should be added to
+# these core steps, meaning they are custom rules necessary for all
+# builds of the pathogen. If there are build-specific customizations,
+# they should be added with the custom_rules imported below to ensure
+# that the core workflow is not complicated by build-specific rules.
+include: "rules/fetch_from_ncbi.smk"
+include: "rules/curate.smk"
+
+
+# We are pushing to standardize ingest workflows with Nextclade runs
+# to include Nextclade outputs in our publicly hosted data. However,
+# if a Nextclade dataset does not already exist, creating one requires
+# curated data as input, so we are making Nextclade steps optional
+# here.
+#
+# If Nextclade config values are included, the nextclade rules will
+# create the final metadata TSV by joining the Nextclade output with
+# the metadata. If Nextclade configs are not included, we rename the
+# subset metadata TSV to the final metadata TSV. To run nextclade.smk
+# rules, include the `defaults/nextclade_config.yaml` config file with
+# `nextstrain build ingest --configfile
+# defaults/nextclade_config.yaml`.
+if "nextclade" in config:
+
+    include: "rules/nextclade.smk"
+
+else:
+
+    rule create_final_metadata:
+        input:
+            metadata="data/subset_metadata.tsv",
+        output:
+            metadata="results/metadata.tsv",
+        shell:
+            """
+            mv {input.metadata} {output.metadata}
+            """
+
+
+# Allow users to import custom rules provided via the config.
+# This allows users to run custom rules that can extend or override
+# the workflow. A concrete example of using custom rules is the
+# extension of the workflow with rules to support the Nextstrain
+# automation that uploads files and sends internal Slack
+# notifications. For extensions, the user will have to specify the
+# custom rule targets when running the workflow. For overrides, the
+# custom Snakefile will have to use the `ruleorder` directive to allow
+# Snakemake to handle ambiguous rules
+# https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#handling-ambiguous-rules
+if "custom_rules" in config:
+    for rule_file in config["custom_rules"]:
+
+        include: rule_file
diff --git a/ingest/build-configs/nextstrain-automation/README.md b/ingest/build-configs/nextstrain-automation/README.md
@@ -0,0 +1,38 @@
+# Nextstrain automation
+
+> [!NOTE]
+> External users can ignore this directory!
+> This build config/customization is tailored for the internal Nextstrain team
+> to extend the core ingest workflow for automated workflows.
+
+## Update the config
+
+Update the [config.yaml](config.yaml) for your pathogen:
+
+1. Edit the `s3_dst` param to add the pathogen repository name.
+2. Edit the `files_to_upload` param to a mapping of files you need to upload for your pathogen.
+The default includes suggested files for uploading curated data and Nextclade outputs.
+
+## Run the workflow
+
+Provide the additional config file to the Snakemake options in order to
+include the custom rules from [upload.smk](upload.smk) in the workflow.
+Specify the `upload_all` target in order to run the additional upload rules.
+
+The upload rules will require AWS credentials for a user that has permissions
+to upload to the Nextstrain data bucket.
+
+The customized workflow can be run from the top level pathogen repo directory with:
+```
+nextstrain build \
+    --env AWS_ACCESS_KEY_ID \
+    --env AWS_SECRET_ACCESS_KEY \
+    ingest \
+        upload_all \
+        --configfile build-configs/nextstrain-automation/config.yaml
+```
+
+## Automated GitHub Action workflows
+
+Additional instructions on how to use this with the shared `pathogen-repo-build`
+GitHub Action workflow to come!
diff --git a/ingest/build-configs/nextstrain-automation/config.yaml b/ingest/build-configs/nextstrain-automation/config.yaml
@@ -0,0 +1,24 @@
+# This configuration file should contain all required configuration parameters
+# for the ingest workflow to run with additional Nextstrain automation rules.
+
+# Custom rules to run as part of the Nextstrain automated workflow The
+# paths should be relative to the ingest directory.
+custom_rules:
+  - build-configs/nextstrain-automation/upload.smk
+
+# Nextstrain CloudFront domain to ensure that we invalidate CloudFront
+# after the S3 uploads This is required as long as we are using the
+# AWS CLI for uploads
+cloudfront_domain: "data.nextstrain.org"
+
+# Nextstrain AWS S3 Bucket with pathogen prefix
+s3_dst: "s3://nextstrain-data/files/workflows/seasonal-cov"
+
+# Mapping of files to upload
+## TODO verify
+files_to_upload:
+  ncbi.ndjson.zst: data/ncbi.ndjson
+  metadata.tsv.zst: results/metadata.tsv
+  sequences.fasta.zst: results/sequences.fasta
+  alignments.fasta.zst: results/alignment.fasta
+  translations.zip: results/translations.zip
diff --git a/ingest/build-configs/nextstrain-automation/upload.smk b/ingest/build-configs/nextstrain-automation/upload.smk
@@ -0,0 +1,43 @@
+"""
+This part of the workflow handles uploading files to AWS S3.
+
+Files to upload must be defined in the `files_to_upload` config param, where
+the keys are the remote files and the values are the local filepaths
+relative to the ingest directory.
+
+Produces a single file for each uploaded file:
+    "results/upload/{remote_file}.upload"
+
+The rule `upload_all` can be used as a target to upload all files.
+"""
+
+import os
+
+slack_envvars_defined = "SLACK_CHANNELS" in os.environ and "SLACK_TOKEN" in os.environ
+send_notifications = config.get("send_slack_notifications", False) and slack_envvars_defined
+
+
+rule upload_to_s3:
+    input:
+        file_to_upload=lambda wildcards: config["files_to_upload"][wildcards.remote_file],
+    output:
+        "results/upload/{remote_file}.upload",
+    params:
+        quiet="" if send_notifications else "--quiet",
+        s3_dst=config["s3_dst"],
+        cloudfront_domain=config["cloudfront_domain"],
+    shell:
+        """
+        ./vendored/upload-to-s3 \
+            {params.quiet} \
+            {input.file_to_upload:q} \
+            {params.s3_dst:q}/{wildcards.remote_file:q} \
+            {params.cloudfront_domain} 2>&1 | tee {output}
+        """
+
+
+rule upload_all:
+    input:
+        uploads=[f"results/upload/{remote_file}.upload" for remote_file in config["files_to_upload"].keys()],
+    output:
+        touch("results/upload_all.done"),
diff --git a/ingest/defaults/annotations.tsv b/ingest/defaults/annotations.tsv
@@ -0,0 +1,6 @@
+# Manually curated annotations TSV file
+# The TSV should not have a header and should have exactly three columns:
+# id to match existing metadata, field name, and field value
+# If there are multiple annotations for the same id and field, then the last value is used
+# Lines starting with '#' are treated as comments
+# Any '#' after the field value are treated as comments.