nextstrain.org/dengue/ingest

This is the ingest pipeline for dengue virus sequences.

Software requirements

Follow the standard installation instructions for Nextstrain's suite of software tools.

Usage

All workflows are expected to the be run from the top level pathogen repo directory. The default ingest workflow should be run with

Fetch sequences with

nextstrain build ingest data/sequences.ndjson

Run the complete ingest pipeline with

nextstrain build ingest

This will produce 10 files (within the ingest directory):

A pair of files with all the dengue sequences:

ingest/results/metadata_all.tsv
ingest/results/sequences_all.fasta

A pair of files for each dengue serotype (denv1 - denv4)

ingest/results/metadata_denv1.tsv
ingest/results/sequences_denv1.fasta
ingest/results/metadata_denv2.tsv
ingest/results/sequences_denv2.fasta
ingest/results/metadata_denv3.tsv
ingest/results/sequences_denv3.fasta
ingest/results/metadata_denv4.tsv
ingest/results/sequences_denv4.fasta

Run the complete ingest pipeline and upload results to AWS S3 with

nextstrain build \
    --env AWS_ACCESS_KEY_ID \
    --env AWS_SECRET_ACCESS_KEY \
    ingest \
        upload_all \
        --configfile build-configs/nextstrain-automation/config.yaml

Adding new sequences not from GenBank

Static Files

Do the following to include sequences from static FASTA files.

Convert the FASTA files to NDJSON files with:

./ingest/scripts/fasta-to-ndjson \
    --fasta {path-to-fasta-file} \
    --fields {fasta-header-field-names} \
    --separator {field-separator-in-header} \
    --exclude {fields-to-exclude-in-output} \
    > ingest/data/{file-name}.ndjson

Add the following to the .gitignore to allow the file to be included in the repo:
```
!ingest/data/{file-name}.ndjson
```
Add the file-name (without the .ndjson extension) as a source to ingest/defaults/config.yaml. This will tell the ingest pipeline to concatenate the records to the GenBank sequences and run them through the same transform pipeline.

Configuration

Configuration takes place in defaults/config.yaml by default. Optional configs for uploading files are in build-configs/nextstrain-automation/config.yaml.

Environment Variables

The complete ingest pipeline with AWS S3 uploads uses the following environment variables:

Required

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

Optional

These are optional environment variables used in our automated pipeline.

GITHUB_RUN_ID - provided via github.run_id in a GitHub Action workflow
AWS_BATCH_JOB_ID - provided via AWS Batch Job environment variables

Input data

GenBank data

GenBank sequences and metadata are fetched via NCBI datasets.

`ingest/vendored`

This repository uses git subrepo to manage copies of ingest scripts in ingest/vendored, from nextstrain/ingest.

See vendored/README.md for instructions on how to update the vendored scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

nextstrain.org/dengue/ingest

Software requirements

Usage

Adding new sequences not from GenBank

Static Files

Configuration

Environment Variables

Required

Optional

Input data

GenBank data

`ingest/vendored`

Files

README.md

Latest commit

History

README.md

File metadata and controls

nextstrain.org/dengue/ingest

Software requirements

Usage

Adding new sequences not from GenBank

Static Files

Configuration

Environment Variables

Required

Optional

Input data

GenBank data

ingest/vendored

`ingest/vendored`