Skip to content

Commit

Permalink
Update ingest and make it work [#2]
Browse files Browse the repository at this point in the history
* sync `ingest/README` with `seasonal-cov` version
* move from `defaults` to `config` for config files
* strip fetch-from-entrez stuff from config and rules files
* add `ncbi_taxon_id` to config
* strip guidance comments, light reformat of Snakemake and rules files
  for readability
* add benchmarks where missing
* remove unused nextclade bits
  • Loading branch information
genehack committed Jun 24, 2024
1 parent cdbd7f6 commit f0e18fe
Show file tree
Hide file tree
Showing 10 changed files with 65 additions and 384 deletions.
111 changes: 24 additions & 87 deletions ingest/README.md
Original file line number Diff line number Diff line change
@@ -1,105 +1,42 @@
# Ingest
# Ingest workflow

This workflow ingests public data from NCBI and outputs curated
metadata and sequences that can be used as input for the phylogenetic
workflow.

## Workflow Usage
If you have another data source or private data that needs to be
formatted for the phylogenetic workflow, then you can use a similar
workflow to curate your own data.

The workflow can be run from the top level pathogen repo directory:
## Config

```bash
nextstrain build ingest
```

Alternatively, the workflow can also be run from within the ingest
directory:

```bash
cd ingest
nextstrain build .
```

This produces the default outputs of the ingest workflow:

- metadata = results/metadata.tsv
- sequences = results/sequences.fasta

### Dumping the full raw metadata from NCBI Datasets

The workflow has a target for dumping the full raw metadata from NCBI
Datasets.

```bash
nextstrain build ingest dump_ncbi_dataset_report
```

This will produce the file `ingest/data/ncbi_dataset_report_raw.tsv`,
which you can inspect to determine what fields and data to use if you
want to configure the workflow for your pathogen.

## Defaults

The defaults directory contains all of the default configurations for
The config directory contains all of the default configurations for
the ingest workflow.

[defaults/config.yaml](defaults/config.yaml) contains all of the
default configuration parameters used for the ingest workflow. Use
Snakemake's `--configfile`/`--config` options to override these
default values.
[config/defaults.yaml][] contains all of the default configuration
parameters used for the ingest workflow. Use Snakemake's
`--configfile`/`--config` options to override these default values.

## Snakefile and rules

The rules directory contains separate Snakefiles (`*.smk`) as modules
of the core ingest workflow. The modules of the workflow are in
separate files to keep the main ingest [Snakefile](Snakefile) succinct
and organized.

The `workdir` is hardcoded to be the ingest directory so all filepaths
for inputs/outputs should be relative to the ingest directory.

Modules are all
[included](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes)
in the main Snakefile in the order that they are expected to run.

### Nextclade

Nextstrain is pushing to standardize ingest workflows with Nextclade
runs to include Nextclade outputs in our publicly hosted data.
However, if a Nextclade dataset does not already exist, it requires
curated data as input, so we are making Nextclade steps optional here.

If Nextclade config values are included, the Nextclade rules will
create the final metadata TSV by joining the Nextclade output with the
metadata. If Nextclade configs are not included, we rename the subset
metadata TSV to the final metadata TSV.

To run Nextclade rules, include the `defaults/nextclade_config.yaml`
config file with:

```bash
nextstrain build ingest --configfile defaults/nextclade_config.yaml
```

> [!TIP]
> If the Nextclade dataset is stable and you always want to run the
> Nextclade rules as part of ingest, we recommend moving the Nextclade
> related config parameters from the `defaults/nextclade_config.yaml`
> file to the default config file `defaults/config.yaml`.
## Build configs

The build-configs directory contains custom configs and rules that
override and/or extend the default workflow.

- [nextstrain-automation](build-configs/nextstrain-automation/) - automated internal Nextstrain builds.
separate files to keep the main ingest [Snakefile][] succinct and
organized. Modules are all [included][] in the main Snakefile in the
order that they are expected to run.

## Vendored

This repository uses
[`git subrepo`](https://github.com/ingydotnet/git-subrepo) to manage copies
of ingest scripts in [vendored](vendored), from
[nextstrain/ingest](https://github.com/nextstrain/ingest).
This repository uses [`git subrepo`][] to manage copies of ingest
scripts in [vendored][], from [nextstrain/ingest][]

See [vendored/README.md][] for instructions on how to update the
vendored scripts.

See [vendored/README.md](vendored/README.md#vendoring) for
instructions on how to update the vendored scripts.
[config/defaults.yaml]: ./config/defaults.yaml
[`git subrepo`]: https://github.com/ingydotnet/git-subrepo
[included]: https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes
[nextstrain/ingest]: https://github.com/nextstrain/ingest
[Snakefile]: ./Snakefile
[vendored]: ./vendored
[vendored/README.md]: ./vendored/README.md#vendoring
70 changes: 8 additions & 62 deletions ingest/Snakefile
Original file line number Diff line number Diff line change
@@ -1,58 +1,20 @@
"""
This is the main ingest Snakefile that orchestrates the full ingest workflow
and defines its default outputs.
"""
# Use default configuration values. Override with Snakemake's --configfile/--config options.
configfile: "config/config.yaml"


# The workflow filepaths are written relative to this Snakefile's base
# directory
workdir: workflow.current_basedir


# Use default configuration values. Override with Snakemake's
# --configfile/--config options.
configfile: "defaults/config.yaml"


# This is the default rule that Snakemake will run when there are no
# specified targets. The default output of the ingest workflow is
# usually the curated metadata and sequences. Nextstrain-maintained
# ingest workflows will produce metadata files with the standard
# Nextstrain fields and additional fields that are pathogen specific.
# We recommend using these standard fields in custom ingests as well
# to minimize the customizations you will need for the downstream
# phylogenetic workflow.


# TODO: Add link to centralized docs on standard Nextstrain metadata fields
rule all:
input:
"results/sequences.fasta",
"results/metadata.tsv",


# Note that only PATHOGEN-level customizations should be added to
# these core steps, meaning they are custom rules necessary for all
# builds of the pathogen. If there are build-specific customizations,
# they should be added with the custom_rules imported below to ensure
# that the core workflow is not complicated by build-specific rules.
include: "rules/fetch_from_ncbi.smk"
include: "rules/curate.smk"


# We are pushing to standardize ingest workflows with Nextclade runs
# to include Nextclade outputs in our publicly hosted data. However,
# if a Nextclade dataset does not already exist, creating one requires
# curated data as input, so we are making Nextclade steps optional
# here.
#
# If Nextclade config values are included, the nextclade rules will
# create the final metadata TSV by joining the Nextclade output with
# the metadata. If Nextclade configs are not included, we rename the
# subset metadata TSV to the final metadata TSV. To run nextclade.smk
# rules, include the `defaults/nextclade_config.yaml` config file with
# `nextstrain build ingest --configfile
# defaults/nextclade_config.yaml`.
# If included, the nextclade rules will create the final metadata TSV
# by joining the Nextclade output with the metadata. However, if not
# including nextclade, we have to rename the subset metadata TSV to
# the final metadata TSV.
if "nextclade" in config:

include: "rules/nextclade.smk"
Expand All @@ -61,26 +23,10 @@ else:

rule create_final_metadata:
input:
metadata="data/subset_metadata.tsv",
metadata="results/subset_metadata.tsv",
output:
metadata="results/metadata.tsv",
shell:
"""
mv {input.metadata} {output.metadata}
mv {input.metadata:q} {output.metadata:q}
"""


# Allow users to import custom rules provided via the config.
# This allows users to run custom rules that can extend or override
# the workflow. A concrete example of using custom rules is the
# extension of the workflow with rules to support the Nextstrain
# automation that uploads files and sends internal Slack
# notifications. For extensions, the user will have to specify the
# custom rule targets when running the workflow. For overrides, the
# custom Snakefile will have to use the `ruleorder` directive to allow
# Snakemake to handle ambiguous rules
# https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#handling-ambiguous-rules
if "custom_rules" in config:
for rule_file in config["custom_rules"]:

include: rule_file
File renamed without changes.
34 changes: 14 additions & 20 deletions ingest/defaults/config.yaml → ingest/config/config.yaml
Original file line number Diff line number Diff line change
@@ -1,14 +1,5 @@
# This configuration file should contain all required configuration parameters
# for the ingest workflow to run to completion.
#
# Define optional config parameters with their default values here so that users
# do not have to dig through the workflows to figure out the default values

# Required to fetch from Entrez
entrez_search_term: ""

# Required to fetch from NCBI Datasets
ncbi_taxon_id: ""
# taxon for `yellow fever virus`
ncbi_taxon_id: "11089"

# The list of NCBI Datasets fields to include from NCBI Datasets output
# These need to be the "mnemonics" of the NCBI Datasets fields, see docs for full list of fields
Expand All @@ -34,16 +25,18 @@ ncbi_datasets_fields:

# Config parameters related to the curate pipeline
curate:
# URL pointed to public generalized geolocation rules
# URL pointed to public generalized geolocation rules.
# For the Nextstrain team, this is currently
# "https://raw.githubusercontent.com/nextstrain/ncov-ingest/master/source-data/gisaid_geoLocationRules.tsv"
# "https://raw.githubusercontent.com/nextstrain/ncov-ingest/master/source-data/gisaid_geoLocationRules.tsv".
geolocation_rules_url: "https://raw.githubusercontent.com/nextstrain/ncov-ingest/master/source-data/gisaid_geoLocationRules.tsv"
# The path to the local geolocation rules within the pathogen repo
# The path should be relative to the ingest directory.
local_geolocation_rules: "defaults/geolocation_rules.tsv"
# List of field names to change where the key is the original field name and the value is the new field name
# The original field names should match the ncbi_datasets_fields provided above.
# This is the first step in the pipeline, so any references to field names in the configs below should use the new field names
local_geolocation_rules: "config/geolocation_rules.tsv"
# List of field names to change where the key is the original field
# name and the value is the new field name. The original field names
# should match the ncbi_datasets_fields provided above. This is the
# first step in the pipeline, so any references to field names in
# the configs below should use the new field names.
field_map:
accession: accession
accession_version: accession_version
Expand All @@ -69,8 +62,9 @@ curate:
strain_backup_fields: ["accession"]
# List of date fields to standardize to ISO format YYYY-MM-DD
date_fields: ["date", "date_released", "date_updated"]
# List of expected date formats that are present in the date fields provided above
# These date formats should use directives expected by datetime
# List of expected date formats that are present in the date fields
# provided above. These date formats should use directives expected
# by datetime.
# See https://docs.python.org/3.9/library/datetime.html#strftime-and-strptime-format-codes
expected_date_formats: ["%Y", "%Y-%m", "%Y-%m-%d", "%Y-%m-%dT%H:%M:%SZ"]
titlecase:
Expand Down Expand Up @@ -107,7 +101,7 @@ curate:
abbr_authors_field: "abbr_authors"
# Path to the manual annotations file
# The path should be relative to the ingest directory
annotations: "defaults/annotations.tsv"
annotations: "config/annotations.tsv"
# The ID field in the metadata to use to merge the manual annotations
annotations_id: "accession"
# The ID field in the metadata to use as the sequence id in the output FASTA file
Expand Down
File renamed without changes.
12 changes: 0 additions & 12 deletions ingest/defaults/nextclade_config.yaml

This file was deleted.

18 changes: 0 additions & 18 deletions ingest/defaults/nextclade_field_map.tsv

This file was deleted.

25 changes: 10 additions & 15 deletions ingest/rules/curate.smk
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,13 @@ OUTPUTS:
"""


# The following two rules can be ignored if you choose not to use the
# generalized geolocation rules that are shared across pathogens.
# The Nextstrain team will try to maintain a generalized set of geolocation
# rules that can then be overridden by local geolocation rules per pathogen repo.
rule fetch_general_geolocation_rules:
output:
general_geolocation_rules="data/general-geolocation-rules.tsv",
params:
geolocation_rules_url=config["curate"]["geolocation_rules_url"],
benchmark:
"benchmarks/fetch_general_geolocation_rules.txt"
shell:
"""
curl {params.geolocation_rules_url} > {output.general_geolocation_rules}
Expand All @@ -34,10 +32,12 @@ rule concat_geolocation_rules:
local_geolocation_rules=config["curate"]["local_geolocation_rules"],
output:
all_geolocation_rules="data/all-geolocation-rules.tsv",
benchmark:
"benchmarks/concat_geolocation_rules.txt"
shell:
# why is this `>>` and not `>`
"""
cat {input.general_geolocation_rules} {input.local_geolocation_rules} >> {output.all_geolocation_rules}
cat {input.general_geolocation_rules} {input.local_geolocation_rules} \
> {output.all_geolocation_rules}
"""


Expand All @@ -48,17 +48,9 @@ def format_field_map(field_map: dict[str, str]) -> str:
return " ".join([f'"{key}"="{value}"' for key, value in field_map.items()])


# This curate pipeline is based on existing pipelines for pathogen repos using NCBI data.
# You may want to add and/or remove steps from the pipeline for custom metadata
# curation for your pathogen. Note that the curate pipeline is streaming NDJSON
# records between scripts, so any custom scripts added to the pipeline should expect
# the input as NDJSON records from stdin and output NDJSON records to stdout.
# The final step of the pipeline should convert the NDJSON records to two
# separate files: a metadata TSV and a sequences FASTA.
rule curate:
input:
sequences_ndjson="data/ncbi.ndjson",
# Change the geolocation_rules input path if you are removing the above two rules
all_geolocation_rules="data/all-geolocation-rules.tsv",
annotations=config["curate"]["annotations"],
output:
Expand Down Expand Up @@ -124,8 +116,11 @@ rule subset_metadata:
subset_metadata="data/subset_metadata.tsv",
params:
metadata_fields=",".join(config["curate"]["metadata_columns"]),
benchmark:
"benchmarks/subset_metadata.txt"
shell:
"""
tsv-select -H -f {params.metadata_fields} \
{input.metadata} > {output.subset_metadata}
{input.metadata} \
> {output.subset_metadata}
"""
Loading

0 comments on commit f0e18fe

Please sign in to comment.