Skip to content

Latest commit

Β 

History

History
402 lines (316 loc) Β· 27.1 KB

README.md

File metadata and controls

402 lines (316 loc) Β· 27.1 KB

TOSTADAS β†’ Toolkit for Open Sequence Triage, Annotation and DAtabase Submission 🧬 πŸ’»

PATHOGEN ANNOTATION AND SUBMISSION PIPELINE

Nextflow run with conda run with docker run with singularity

Overview

The Pathogen Annotation and Submission pipeline facilitates the running of several Python scripts, which validate metadata (QC), annotate assembled genomes, and submit to NCBI. Current implementation was tested using MPOX but future testing will seek to made the pipeline pathogen-agnostic.

Table of Contents

Pipeline Summary

Metadata Validation

The validation workflow checks if metadata conforms to NCBI standards and matches the input fasta file. The script also splits a multi-sample xlsx file into a separate .tsv file for each individual.

Liftoff

The liftoff workflow annotates input fasta-formatted genomes and produces accompanying gff and genbank tbl files. The input includes the reference genome fasta, reference gff and your multi-sample fasta and metadata in .xlsx format. The Liftoff workflow was brought over and integrated from the Liftoff tool, responsible for accurately mapping annotations for assembled genomes.

Submission

Submission workflow generates the necessary files for Genbank submission, generates a BioSample ID, then optionally uploads Fastq files via FTP to SRA. This workflow was adapted from SeqSender public database submission pipeline.

Setup

Environment Setup

The environment setup needs to occur within a terminal, or can optionally be handled by the Nextflow pipeline according to the conda block of the nextflow.config file.

  • Note: With mamba and nextflow installed, when you run nextflow it will create the environment from the provided environment.yml.
  • If you want to create a personalized environment you can create this environment as long as the environment name lines up with the environment name provided in the environment.yml file.

(1) First install mamba:

curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

(2) Add mamba to PATH:

export PATH="$HOME/mambaforge/bin:$PATH"

(3) Now you can create the conda environment and install the dependencies set in your environment.yml:

mamba create -n tostadas -f environment.yml   

(4) After the environment is created activate the environment. Always make sure to activate the environment with each new session.

source activate tostadas

(5) To examine which environment is active, run the following conda command: conda env list , then the active environment will be denoted with an asterisk*

(6) The final piece to the environment set up is to install nextflow (optionally with conda):

  • First make sure your path is set correctly and you are active in your tostadas environment. Then run the following command to install nextflow with Conda:
mamba install -c bioconda nextflow

Access the link provided for help with installing nextflow

Repository Setup

To clone the code from the repo to your local machine:

git clone https://github.com/CDCgov/tostadas.git

If the following applies to you:

Then, follow the cloning instructions outlined here: cdc_configs_access

Quick Start

The configs are set-up to run the default params with the test option

(1) Ensure nextflow was installed successfully by running Nextflow -v

* Version of nextflow should be >=22.10.0

(2) Check that you are in the project directory (Tostadas).

This is the default directory set in the nextflow.config file to allow for running the nextflow pipeline with the provided test input files.

(3) Change the submission_config parameter within test_params.config to the location of your personal submission config file.

(4) Run the following nextflow command to execute the scripts with default parameters and with local run environment:

nextflow run main.nf -profile test,conda

The outputs of the pipeline will appear in the "nf_test_results" folder within the project directory (update this in the standard params set for a different output path).

Running the Pipeline

How to Run:

The typical command to run the pipeline based on your custom parameters defined/saved in the standard_params.config (more information about profiles and parameter sets below) and created conda environment is as follows:

nextflow run main.nf -profile standard,conda

OR with the parameters specified in the .json/.yaml files with the following command:

nextflow run main.nf -profile standard,conda --<param name> <param value>

Other options for the run environment include docker and singularity. These options can be used simply by replacing the second profile option:

nextflow run main.nf -profile standard,<docker or singularity>

Either one of the above commands will launch the nextflow pipeline and show the progress of the subworkflow:process and checks looking similar to below depending on the entrypoint specified.

N E X T F L O W  ~  version 22.10.0
Launching `main.nf` [festering_spence] DSL2 - revision: 3441f714f2
executor >  local (7)
[e5/9dbcbc] process > VALIDATE_PARAMS                                  [100%] 1 of 1 Òœ”
[53/a833be] process > CLEANUP_FILES                                    [100%] 1 of 1 Òœ”
[e4/a50c97] process > with_submission:METADATA_VALIDATION (1)          [100%] 1 of 1 Òœ”
[81/badd3b] process > with_submission:LIFTOFF (1)                      [100%] 1 of 1 Òœ”
[d7/16d16a] process > with_submission:RUN_SUBMISSION:SUBMISSION (1)    [100%] 1 of 1 Òœ”
[3c/8c7ba4] process > with_submission:RUN_SUBMISSION:GET_WAIT_TIME (1) [100%] 1 of 1 Òœ”
[13/85f6f3] process > with_submission:RUN_SUBMISSION:WAIT (1)          [  0%] 0 of 1
[-        ] process > with_submission:RUN_SUBMISSION:UPDATE_SUBMISSION -
USING CONDA

** NOTE: The default wait time between initial submission and updating the submitted samples is three minutes or 180 seconds per sample. To override this default calculation, you can modify the submission_wait_time parameter within your config or through the command line (in terms of seconds):

nextflow run main.nf -profile <param set>,<env> --submission_wait_time 360

Outputs will be generated in the nf_test_results folder (if running the test parameter set) unless otherwise specified in your standard_params.config file as output_dir param.

Profile Options & Input Files

This section walks through the available parameters to customize your workflow.

Input Files Required:

(A) This table lists the required files to run metadata validation and liftoff annotation:

Input files File type Description
fasta .fasta Multi-sample fasta file with your input sequences
metadata .xlsx Multi-sample metadata matching metadata spreadsheets provided in input_files
ref_fasta .fasta Reference genome to use for the liftoff_submission branch of the pipeline
ref_gff .gff Reference GFF3 file to use for the liftoff_submission branch of the pipeline

(B) This table lists the required files to run with submission:

Input files File type Description
fasta .fasta Multi-sample fasta file with your input sequences
metadata .xlsx Multi-sample metadata matching metadata spreadsheets provided in input_files
ref_fasta .fasta Reference genome to use for the liftoff_submission branch of the pipeline
ref_gff .gff Reference GFF3 file to use for the liftoff_submission branch of the pipeline
submission_config .yaml configuration file for submitting to NCBI, sample versions can be found in repo

Customizing Parameters:

The standard_params.config file found within the conf directory is where parameters can be adjusted based on preference for running the pipeline. First you will want to ensure the file paths are correctly set for the params listed above depending on your preference for submitting your results.

  • Adjust your file inputs within standard_params.config ensuring accurate file paths for the inputs listed above.
  • The params can be changed within the standard_params.config or you can change the standard.yml/standard.json file inside the nf_params directory and pass it in with: -params-file <standard_params.yml or standard_params.json>
  • Note: DO NOT EDIT the main.nf file or other paths in the nextflow.config unless familiar with editing nextflow workflows

Understanding Profiles and Environments:

Within the nextflow pipeline the -profile option is required as an input. The profile options with the pipeline include test and standard. These two options can be seen listed in the nextflow.config file. The test params should remain the same for testing purposes, but the standard profile can be changed to fit user preferences. Also within the nextflow pipeline there is the use of varying run environments as the second profile input. Nextflow expects at least one option for both of these configurations to be passed in: -profile <test/standard>,<conda/docker/singularity>

Toggling Submission:

Now that your file paths are set within your standard.yml or standard.json or standard_params.config file, you will want to define whether to run the full pipeline with submission or without submission. This is defined within the standard_params.config file underneath the subworkflow section as run_submission run_submission = true/false

  • Apart from this main bifurcation, there exists entrypoints that you can use to access specific processes. More information is listed in the table below.

More Information on Submission:

The submission piece of the pipeline uses the processes that are directly integrated from SeqSender public database submission pipeline. It has been developed to allow the user to create a config file to select which databases they would like to upload to and allows for any possible metadata fields by using a YAML to pair the database's metadata fields which your personal metadata field columns. The requirements for this portion of the pipeline to run are listed below.

(A) Create Appropriate Accounts as needed for the SeqSender public database submission pipeline integrated into TOSTADAS:

  • NCBI: If uploading to NCBI, an account is required along with a center account approved for submitting via FTP. Contact the following for account creation:[email protected].
  • GISAID: A GISAID account is required for submission to GISAID, you can register for an account at https://www.gisaid.org/. Test submissions are first required before a final submission can be made. When your first test submission is complete contact GISAID at [email protected] to recieve a personal CID. GISAID support is not yet implemented but it may be added in the future.

(B) Config File Set-up:

  • The template for the submission config file can be found in bin/default_config_files within the repo. This is where you can edit the various parameters you want to include in your submission.

Entrypoints:

Table of entrypoints available for the nextflow pipeline:

Workflow Description
only_validate_params Validates parameters utilizing the validate params process within the utility sub-workflow
only_cleanup_files Cleans-up files utilizing the clean-up process within the utility sub-workflow
only_validation Runs the metadata validation process only
only_liftoff Runs the liftoff annotation process only
only_submission Runs submission sub-workflow only
only_initial_submission Runs the initial submission process but not follow-up within the submission sub-workflow
only_update_submission Updates NCBI submissions

The following command can be used to specify entrypoints for the workflow:

nextflow run main.nf -profile <param set>,<env> -entry <insert option from table above>

Outputs

The following section walks through the outputs from the pipeline.

Pipeline Overview:

The workflow will generate outputs in the following order:

  • Validation
    • Responsible for QC of metadata
    • Aligns sample metadata .xlsx to sample .fasta
    • Formats metadata into .tsv format
  • Annotation
    • Extracts features from .gff
    • Aligns features
    • Annotates sample genomes outputting .gff
  • Submission
    • Formats for database submission
    • This section runs twice, with the second run occuring after a wait time to allow for all samples to be uploaded to NCBI. Entrypoint only_update_submission can be run as many times as necessary until all files are fully uploaded.

Output Directory Formatting:

The outputs are recorded in the directory specified within the nextflow.config file and will contain the following:

  • validation_outputs (**name configurable with val_output_dir)
    • sample_metadata_run
      • errors
      • tsv_per_sample
  • liftoff_outputs (**name configurable with final_liftoff_output_dir)
    • final_sample_metadata_file
      • errors
      • fasta
      • liftoff
      • tbl
  • submission_outputs (**name and path configurable with submission_output_dir)
    • individual_sample_batch_info
      • biosample_sra
      • genbank
      • accessions.csv
    • terminal_outputs
    • commands_used
  • liftoffCommand.txt

Understanding Pipeline Outputs:

The pipeline outputs inlcude:

  • metadata.tsv files for each sample
  • separate fasta files for each sample
  • separate gff files for each sample
  • separate tbl files containing feature information for each sample
  • submission log file
    • This output is found in the submission_outputs file in your specified output_directory
    • If the file can not be found you can run the only_update_submission entrypoint for the pipeline

Parameters:

Default parameters are given in the nextflow.config file. This table lists the parameters that can be changed to a value, path or true/false. When changing these parameters pay attention to the required inputs and make sure that paths line-up and values are within range. To change a parameter you may change with a flag after the nextflow command or change them within your standard_params.config or standard.yaml file.

  • Please note the correct formatting and the default calculation of submission_wait_time at the bottom of the params table.

Input Files

Param Description Input Required
--fasta_path Path to fasta file Yes (path as string)
--ref_fasta_path Reference Sequence file path Yes (path as string)
--meta_path Meta-data file path for samples Yes (path as string)
--ref_gff_path Reference gff file path for annotation Yes (path as string)
--env_yml Path to environment.yml file Yes (path as string)

Run Environment

Param Description Input Required
--scicomp Flag for whether running on Scicomp or not Yes (true/false as bool)
--docker_container Name of the Docker container Yes, if running with docker profile (name as string)

General Subworkflow

Param Description Input Required
--run_submission Toggle for running submission Yes (true/false as bool)
--cleanup Toggle for running cleanup subworkflows Yes (true/false as bool)

Cleanup Subworkflow

Param Description Input Required
--clear_nextflow_log Clears nextflow work log Yes (true/false as bool)
--clear_nextflow_dir Clears nextflow working directory Yes (true/false as bool)
--clear_work_dir Param to clear work directory created during workflow Yes (true/false as bool)
--clear_conda_env Clears conda environment Yes (true/false as bool)
--clear_nf_results Remove results from nextflow outputs Yes (true/false as bool)

General Output

Param Description Input Required
--output_dir File path to submit outputs from pipeline Yes (path as string)
--overwrite_output Toggle to overwriting output files in directory Yes (true/false as bool)

Metadata Validation

Param Description Input Required
--val_output_dir File path for outputs specific to validate sub-workflow Yes (folder name as string)
--val_date_format_flag Flag to change date output Yes (-s, -o, or -v as string)
--val_keep_pi Flag to keep personal identifying info, if provided otherwise it will return an error Yes (true/false as bool)

Liftoff

Param Description Input Required
--final_liftoff_output_dir File path to liftoff specific sub-workflow outputs Yes (folder name as string)
--lift_print_version_exit Print version and exit the program Yes (true/false as bool)
--lift_print_help_exit Print help and exit the program Yes (true/false as bool)
--lift_parallel_processes # of parallel processes to use for liftoff Yes (integer)
--lift_delete_temp_files Deletes the temporary files after finishing transfer Yes (true/false as string)
--lift_child_feature_align_threshold Only if its child features usually exons/CDS align with sequence identity Ò‰Β₯S designate a feature mapped
--lift_unmapped_feature_file_name Name of unmapped features file name Yes (path as string)
--lift_copy_threshold Minimum sequence identity in exons/CDS for which a gene is considered a copy; must be greater than -s; default is 1.0 Yes (float)
--lift_distance_scaling_factor Distance scaling factor; by default D =2.0 Yes (float)
--lift_flank Amount of flanking sequence to align as a fraction of gene length Yes (float between [0.0-1.0])
--lift_overlap Maximum fraction of overlap allowed by 2 features Yes (float between [0.0-1.0])
--lift_mismatch Mismatch penalty in exons when finding best mapping; by default M=2 Yes (integer)
--lift_gap_open Gap open penalty in exons when finding best mapping; by default GO=2 Yes (integer)
--lift_gap_extend Gap extend penalty in exons when finding best mapping; by default GE=1 Yes (integer)
--lift_infer_transcripts Use if annotation file only includes exon/CDS features and does not include transcripts/mRNA Yes (True/False as string)
--lift_copies Look for extra gene copies in the target genome Yes (True/False as string)
--lift_minimap_path Path to minimap if you did not use conda or pip Yes (N/A or path as string)
--lift_feature_database_name Name of the feature database, if none, then will use ref gff path to construct one Yes (N/A or name as string)

Submission

Param Description Input Required
--submission_output_dir Either name or relative/absolute path for the outputs from submission Yes (name or path as string)
--submission_prod_or_test Whether to submit samples for test or actual production Yes (prod or test as string)
--submission_only_meta Full path directly to the dirs containing validate metadata files Yes (path as string)
--submission_only_gff Full path directly to the directory with reformatted GFFs Yes (path as string)
--submission_only_fasta Full path directly to the directory with split fastas for each sample Yes (path as string)
--submission_config Configuration file for submission to public repos Yes (path as string)
--submission_wait_time Calculated based on sample number (3 * 60 secs * sample_num) integer (seconds)
--batch_name Name of the batch to prefix samples with during submission Yes (name as string)
--send_submission_email Toggle email notification on/off Yes (true/false as bool)
--req_col_config Path to the required_columns.yaml file Yes (path as string)
--processed_samples Path to the directory containing processed samples for update only submission entrypoint (containing <batch_name>.<sample_name> dirs) Yes (path as string)

** Important note about send_submission_email: An email is only triggered if Genbank is being submitted to AND table2asn is the genbank_submission_type. As for the recipient, this must be specified within your submission config file under 'general' as 'notif_email_recipient'

Helpful Links for Resources and Software Integrated with TOSTADAS:

πŸ”— Anaconda Install: https://docs.anaconda.com/anaconda/install/

πŸ”— Nextflow Documentation: https://www.nextflow.io/docs/latest/getstarted.html

πŸ”— SeqSender Documentation: https://github.com/CDCgov/seqsender

πŸ”— Liftoff Documentation: https://github.com/agshumate/Liftoff

πŸ”— VADR Documentation: https://github.com/ncbi/vadr.git

πŸ”— table2asn Documentation: https://github.com/svn2github/NCBI_toolkit/blob/master/src/app/table2asn/table2asn.cpp

Acknowledgements

Michael Desch | Ethan Hetrick | Nick Johnson | Kristen Knipe | Shatavia Morrison
Yuanyuan Wang | Michael Weigand | Dhwani Batra | Jason Caravas | Ankush Gupta
Kyle O'Connell | Yesh Kulasekarapandian | Cole Tindall | Lynsey Kovar | Hunter Seabolt
Crystal Gigante | Christina Hutson | Brent Jenkins | Yu Li | Ana Litvintseva
Matt Mauldin | Dakota Howard | Ben Rambo-Martin | James Heuser | Justin Lee | Mili Sheth