Skip to content

Commit

Permalink
Pull ingest and phylogenetic workflows from guide repo [#2]
Browse files Browse the repository at this point in the history
Also apply pre-commit cleanup
  • Loading branch information
genehack committed May 23, 2024
1 parent df1a6dd commit 3a2acd2
Showing 25 changed files with 1,122 additions and 0 deletions.
105 changes: 105 additions & 0 deletions ingest/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Ingest

This workflow ingests public data from NCBI and outputs curated
metadata and sequences that can be used as input for the phylogenetic
workflow.

## Workflow Usage

The workflow can be run from the top level pathogen repo directory:

```bash
nextstrain build ingest
```

Alternatively, the workflow can also be run from within the ingest
directory:

```bash
cd ingest
nextstrain build .
```

This produces the default outputs of the ingest workflow:

- metadata = results/metadata.tsv
- sequences = results/sequences.fasta

### Dumping the full raw metadata from NCBI Datasets

The workflow has a target for dumping the full raw metadata from NCBI
Datasets.

```bash
nextstrain build ingest dump_ncbi_dataset_report
```

This will produce the file `ingest/data/ncbi_dataset_report_raw.tsv`,
which you can inspect to determine what fields and data to use if you
want to configure the workflow for your pathogen.

## Defaults

The defaults directory contains all of the default configurations for
the ingest workflow.

[defaults/config.yaml](defaults/config.yaml) contains all of the
default configuration parameters used for the ingest workflow. Use
Snakemake's `--configfile`/`--config` options to override these
default values.

## Snakefile and rules

The rules directory contains separate Snakefiles (`*.smk`) as modules
of the core ingest workflow. The modules of the workflow are in
separate files to keep the main ingest [Snakefile](Snakefile) succinct
and organized.

The `workdir` is hardcoded to be the ingest directory so all filepaths
for inputs/outputs should be relative to the ingest directory.

Modules are all
[included](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes)
in the main Snakefile in the order that they are expected to run.

### Nextclade

Nextstrain is pushing to standardize ingest workflows with Nextclade
runs to include Nextclade outputs in our publicly hosted data.
However, if a Nextclade dataset does not already exist, it requires
curated data as input, so we are making Nextclade steps optional here.

If Nextclade config values are included, the Nextclade rules will
create the final metadata TSV by joining the Nextclade output with the
metadata. If Nextclade configs are not included, we rename the subset
metadata TSV to the final metadata TSV.

To run Nextclade rules, include the `defaults/nextclade_config.yaml`
config file with:

```bash
nextstrain build ingest --configfile defaults/nextclade_config.yaml
```

> [!TIP]
> If the Nextclade dataset is stable and you always want to run the
> Nextclade rules as part of ingest, we recommend moving the Nextclade
> related config parameters from the `defaults/nextclade_config.yaml`
> file to the default config file `defaults/config.yaml`.
## Build configs

The build-configs directory contains custom configs and rules that
override and/or extend the default workflow.

- [nextstrain-automation](build-configs/nextstrain-automation/) - automated internal Nextstrain builds.

## Vendored

This repository uses
[`git subrepo`](https://github.com/ingydotnet/git-subrepo) to manage copies
of ingest scripts in [vendored](vendored), from
[nextstrain/ingest](https://github.com/nextstrain/ingest).

See [vendored/README.md](vendored/README.md#vendoring) for
instructions on how to update the vendored scripts.
86 changes: 86 additions & 0 deletions ingest/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
"""
This is the main ingest Snakefile that orchestrates the full ingest workflow
and defines its default outputs.
"""


# The workflow filepaths are written relative to this Snakefile's base
# directory
workdir: workflow.current_basedir


# Use default configuration values. Override with Snakemake's
# --configfile/--config options.
configfile: "defaults/config.yaml"


# This is the default rule that Snakemake will run when there are no
# specified targets. The default output of the ingest workflow is
# usually the curated metadata and sequences. Nextstrain-maintained
# ingest workflows will produce metadata files with the standard
# Nextstrain fields and additional fields that are pathogen specific.
# We recommend using these standard fields in custom ingests as well
# to minimize the customizations you will need for the downstream
# phylogenetic workflow.


# TODO: Add link to centralized docs on standard Nextstrain metadata fields
rule all:
input:
"results/sequences.fasta",
"results/metadata.tsv",


# Note that only PATHOGEN-level customizations should be added to
# these core steps, meaning they are custom rules necessary for all
# builds of the pathogen. If there are build-specific customizations,
# they should be added with the custom_rules imported below to ensure
# that the core workflow is not complicated by build-specific rules.
include: "rules/fetch_from_ncbi.smk"
include: "rules/curate.smk"


# We are pushing to standardize ingest workflows with Nextclade runs
# to include Nextclade outputs in our publicly hosted data. However,
# if a Nextclade dataset does not already exist, creating one requires
# curated data as input, so we are making Nextclade steps optional
# here.
#
# If Nextclade config values are included, the nextclade rules will
# create the final metadata TSV by joining the Nextclade output with
# the metadata. If Nextclade configs are not included, we rename the
# subset metadata TSV to the final metadata TSV. To run nextclade.smk
# rules, include the `defaults/nextclade_config.yaml` config file with
# `nextstrain build ingest --configfile
# defaults/nextclade_config.yaml`.
if "nextclade" in config:

include: "rules/nextclade.smk"

else:

rule create_final_metadata:
input:
metadata="data/subset_metadata.tsv",
output:
metadata="results/metadata.tsv",
shell:
"""
mv {input.metadata} {output.metadata}
"""


# Allow users to import custom rules provided via the config.
# This allows users to run custom rules that can extend or override
# the workflow. A concrete example of using custom rules is the
# extension of the workflow with rules to support the Nextstrain
# automation that uploads files and sends internal Slack
# notifications. For extensions, the user will have to specify the
# custom rule targets when running the workflow. For overrides, the
# custom Snakefile will have to use the `ruleorder` directive to allow
# Snakemake to handle ambiguous rules
# https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#handling-ambiguous-rules
if "custom_rules" in config:
for rule_file in config["custom_rules"]:

include: rule_file
38 changes: 38 additions & 0 deletions ingest/build-configs/nextstrain-automation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Nextstrain automation

> [!NOTE]
> External users can ignore this directory!
> This build config/customization is tailored for the internal Nextstrain team
> to extend the core ingest workflow for automated workflows.
## Update the config

Update the [config.yaml](config.yaml) for your pathogen:

1. Edit the `s3_dst` param to add the pathogen repository name.
2. Edit the `files_to_upload` param to a mapping of files you need to upload for your pathogen.
The default includes suggested files for uploading curated data and Nextclade outputs.

## Run the workflow

Provide the additional config file to the Snakemake options in order to
include the custom rules from [upload.smk](upload.smk) in the workflow.
Specify the `upload_all` target in order to run the additional upload rules.

The upload rules will require AWS credentials for a user that has permissions
to upload to the Nextstrain data bucket.

The customized workflow can be run from the top level pathogen repo directory with:
```
nextstrain build \
--env AWS_ACCESS_KEY_ID \
--env AWS_SECRET_ACCESS_KEY \
ingest \
upload_all \
--configfile build-configs/nextstrain-automation/config.yaml
```

## Automated GitHub Action workflows

Additional instructions on how to use this with the shared `pathogen-repo-build`
GitHub Action workflow to come!
24 changes: 24 additions & 0 deletions ingest/build-configs/nextstrain-automation/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# This configuration file should contain all required configuration parameters
# for the ingest workflow to run with additional Nextstrain automation rules.

# Custom rules to run as part of the Nextstrain automated workflow The
# paths should be relative to the ingest directory.
custom_rules:
- build-configs/nextstrain-automation/upload.smk

# Nextstrain CloudFront domain to ensure that we invalidate CloudFront
# after the S3 uploads This is required as long as we are using the
# AWS CLI for uploads
cloudfront_domain: "data.nextstrain.org"

# Nextstrain AWS S3 Bucket with pathogen prefix
s3_dst: "s3://nextstrain-data/files/workflows/seasonal-cov"

# Mapping of files to upload
## TODO verify
files_to_upload:
ncbi.ndjson.zst: data/ncbi.ndjson
metadata.tsv.zst: results/metadata.tsv
sequences.fasta.zst: results/sequences.fasta
alignments.fasta.zst: results/alignment.fasta
translations.zip: results/translations.zip
43 changes: 43 additions & 0 deletions ingest/build-configs/nextstrain-automation/upload.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
"""
This part of the workflow handles uploading files to AWS S3.
Files to upload must be defined in the `files_to_upload` config param, where
the keys are the remote files and the values are the local filepaths
relative to the ingest directory.
Produces a single file for each uploaded file:
"results/upload/{remote_file}.upload"
The rule `upload_all` can be used as a target to upload all files.
"""

import os

slack_envvars_defined = "SLACK_CHANNELS" in os.environ and "SLACK_TOKEN" in os.environ
send_notifications = config.get("send_slack_notifications", False) and slack_envvars_defined


rule upload_to_s3:
input:
file_to_upload=lambda wildcards: config["files_to_upload"][wildcards.remote_file],
output:
"results/upload/{remote_file}.upload",
params:
quiet="" if send_notifications else "--quiet",
s3_dst=config["s3_dst"],
cloudfront_domain=config["cloudfront_domain"],
shell:
"""
./vendored/upload-to-s3 \
{params.quiet} \
{input.file_to_upload:q} \
{params.s3_dst:q}/{wildcards.remote_file:q} \
{params.cloudfront_domain} 2>&1 | tee {output}
"""


rule upload_all:
input:
uploads=[f"results/upload/{remote_file}.upload" for remote_file in config["files_to_upload"].keys()],
output:
touch("results/upload_all.done"),
6 changes: 6 additions & 0 deletions ingest/defaults/annotations.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Manually curated annotations TSV file
# The TSV should not have a header and should have exactly three columns:
# id to match existing metadata, field name, and field value
# If there are multiple annotations for the same id and field, then the last value is used
# Lines starting with '#' are treated as comments
# Any '#' after the field value are treated as comments.
Loading

0 comments on commit 3a2acd2

Please sign in to comment.