TOSTADAS → Toolkit for Open Sequence Triage, Annotation and DAtabase Submission 🧬 💻

PATHOGEN ANNOTATION AND SUBMISSION PIPELINE

Overview

The Pathogen Annotation and Submission pipeline facilitates the running of several Python scripts, which validate metadata (QC), annotate assembled genomes, and submit to NCBI. Current implementation was tested using MPOX but future testing will seek to made the pipeline pathogen-agnostic.

Pipeline Summary

Metadata Validation

The validation workflow checks if metadata conforms to NCBI standards and matches the input fasta file. The script also splits a multi-sample xlsx file into a separate .tsv file for each individual.

Liftoff

The liftoff workflow annotates input fasta-formatted genomes and produces accompanying gff and genbank tbl files. The input includes the reference genome fasta, reference gff and your multi-sample fasta and metadata in .xlsx format. The Liftoff workflow was brought over and integrated from the Liftoff tool, responsible for accurately mapping annotations for assembled genomes.

Submission

Submission workflow generates the necessary files for Genbank submission, generates a BioSample ID, then optionally uploads Fastq files via FTP to SRA. This workflow was adapted from SeqSender public database submission pipeline.

Setup

Environment Setup

The environment setup needs to occur within a terminal, or can optionally be handled by the Nextflow pipeline according to the conda block of the nextflow.config file.

Note: With mamba and nextflow installed, when you run nextflow it will create the environment from the provided environment.yml.
If you want to create a personalized environment you can create this environment as long as the environment name lines up with the environment name provided in the environment.yml file.

(1) First install mamba:

curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

(2) Add mamba to PATH:

export PATH="$HOME/mambaforge/bin:$PATH"

(3) Now you can create the conda environment and install the dependencies set in your environment.yml:

mamba create -n tostadas -f environment.yml

(4) After the environment is created activate the environment. Always make sure to activate the environment with each new session.

source activate tostadas

(5) To examine which environment is active, run the following conda command: `conda env list` , then the active environment will be denoted with an asterisk*

(6) The final piece to the environment set up is to install nextflow (optionally with conda):

First make sure your path is set correctly and you are active in your tostadas environment. Then run the following command to install nextflow with Conda:

mamba install -c bioconda nextflow

Access the link provided for help with installing nextflow

Repository Setup

To clone the code from the repo to your local machine:

git clone https://github.com/CDCgov/tostadas.git

If the following applies to you:

CDC user with access to the Monkeypox group on Gitlab (https://git.biotech.cdc.gov/monkeypox)
Require access to available submission config files

Then, follow the cloning instructions outlined here: cdc_configs_access

Quick Start

The configs are set-up to run the default params with the test option

(1) Ensure nextflow was installed successfully by running `Nextflow -v`

* Version of nextflow should be >=22.10.0

(2) Check that you are in the project directory (Tostadas).

This is the default directory set in the nextflow.config file to allow for running the nextflow pipeline with the provided test input files.

(3) Change the `submission_config` parameter within `test_params.config` to the location of your personal submission config file.

(4) Run the following nextflow command to execute the scripts with default parameters and with local run environment:

nextflow run main.nf -profile test,conda

The outputs of the pipeline will appear in the "nf_test_results" folder within the project directory (update this in the standard params set for a different output path).

Running the Pipeline

How to Run:

The typical command to run the pipeline based on your custom parameters defined/saved in the standard_params.config (more information about profiles and parameter sets below) and created conda environment is as follows:

nextflow run main.nf -profile standard,conda

OR with the parameters specified in the .json/.yaml files with the following command:

nextflow run main.nf -profile standard,conda --<param name> <param value>

Other options for the run environment include docker and singularity. These options can be used simply by replacing the second profile option:

nextflow run main.nf -profile standard,<docker or singularity>

Either one of the above commands will launch the nextflow pipeline and show the progress of the subworkflow:process and checks looking similar to below depending on the entrypoint specified.

N E X T F L O W  ~  version 22.10.0
Launching `main.nf` [festering_spence] DSL2 - revision: 3441f714f2
executor >  local (7)
[e5/9dbcbc] process > VALIDATE_PARAMS                                  [100%] 1 of 1 âœ”
[53/a833be] process > CLEANUP_FILES                                    [100%] 1 of 1 âœ”
[e4/a50c97] process > with_submission:METADATA_VALIDATION (1)          [100%] 1 of 1 âœ”
[81/badd3b] process > with_submission:LIFTOFF (1)                      [100%] 1 of 1 âœ”
[d7/16d16a] process > with_submission:RUN_SUBMISSION:SUBMISSION (1)    [100%] 1 of 1 âœ”
[3c/8c7ba4] process > with_submission:RUN_SUBMISSION:GET_WAIT_TIME (1) [100%] 1 of 1 âœ”
[13/85f6f3] process > with_submission:RUN_SUBMISSION:WAIT (1)          [  0%] 0 of 1
[-        ] process > with_submission:RUN_SUBMISSION:UPDATE_SUBMISSION -
USING CONDA

** NOTE: The default wait time between initial submission and updating the submitted samples is three minutes or 180 seconds per sample. To override this default calculation, you can modify the submission_wait_time parameter within your config or through the command line (in terms of seconds):

nextflow run main.nf -profile <param set>,<env> --submission_wait_time 360

Outputs will be generated in the nf_test_results folder (if running the test parameter set) unless otherwise specified in your standard_params.config file as output_dir param.

Profile Options & Input Files

This section walks through the available parameters to customize your workflow.

Input Files Required:

(A) This table lists the required files to run metadata validation and liftoff annotation:

Input files	File type	Description
fasta	.fasta	Multi-sample fasta file with your input sequences
metadata	.xlsx	Multi-sample metadata matching metadata spreadsheets provided in input_files
ref_fasta	.fasta	Reference genome to use for the liftoff_submission branch of the pipeline
ref_gff	.gff	Reference GFF3 file to use for the liftoff_submission branch of the pipeline

(B) This table lists the required files to run with submission:

Input files	File type	Description
fasta	.fasta	Multi-sample fasta file with your input sequences
metadata	.xlsx	Multi-sample metadata matching metadata spreadsheets provided in input_files
ref_fasta	.fasta	Reference genome to use for the liftoff_submission branch of the pipeline
ref_gff	.gff	Reference GFF3 file to use for the liftoff_submission branch of the pipeline
submission_config	.yaml	configuration file for submitting to NCBI, sample versions can be found in repo

Customizing Parameters:

The standard_params.config file found within the conf directory is where parameters can be adjusted based on preference for running the pipeline. First you will want to ensure the file paths are correctly set for the params listed above depending on your preference for submitting your results.

Adjust your file inputs within standard_params.config ensuring accurate file paths for the inputs listed above.
The params can be changed within the standard_params.config or you can change the standard.yml/standard.json file inside the nf_params directory and pass it in with: -params-file <standard_params.yml or standard_params.json>
Note: DO NOT EDIT the main.nf file or other paths in the nextflow.config unless familiar with editing nextflow workflows

Understanding Profiles and Environments:

Within the nextflow pipeline the -profile option is required as an input. The profile options with the pipeline include test and standard. These two options can be seen listed in the nextflow.config file. The test params should remain the same for testing purposes, but the standard profile can be changed to fit user preferences. Also within the nextflow pipeline there is the use of varying run environments as the second profile input. Nextflow expects at least one option for both of these configurations to be passed in: -profile <test/standard>,<conda/docker/singularity>

Toggling Submission:

Now that your file paths are set within your standard.yml or standard.json or standard_params.config file, you will want to define whether to run the full pipeline with submission or without submission. This is defined within the standard_params.config file underneath the subworkflow section as run_submission run_submission = true/false

Apart from this main bifurcation, there exists entrypoints that you can use to access specific processes. More information is listed in the table below.

More Information on Submission:

The submission piece of the pipeline uses the processes that are directly integrated from SeqSender public database submission pipeline. It has been developed to allow the user to create a config file to select which databases they would like to upload to and allows for any possible metadata fields by using a YAML to pair the database's metadata fields which your personal metadata field columns. The requirements for this portion of the pipeline to run are listed below.

(A) Create Appropriate Accounts as needed for the SeqSender public database submission pipeline integrated into TOSTADAS:

NCBI: If uploading to NCBI, an account is required along with a center account approved for submitting via FTP. Contact the following for account creation:gb-admin@ncbi.nlm.nih.gov.
GISAID: A GISAID account is required for submission to GISAID, you can register for an account at https://www.gisaid.org/. Test submissions are first required before a final submission can be made. When your first test submission is complete contact GISAID at hcov-19@gisaid.org to recieve a personal CID. GISAID support is not yet implemented but it may be added in the future.

(B) Config File Set-up:

The template for the submission config file can be found in bin/default_config_files within the repo. This is where you can edit the various parameters you want to include in your submission.

Entrypoints:

Table of entrypoints available for the nextflow pipeline:

Workflow	Description
only_validate_params	Validates parameters utilizing the validate params process within the utility sub-workflow
only_cleanup_files	Cleans-up files utilizing the clean-up process within the utility sub-workflow
only_validation	Runs the metadata validation process only
only_liftoff	Runs the liftoff annotation process only
only_submission	Runs submission sub-workflow only
only_initial_submission	Runs the initial submission process but not follow-up within the submission sub-workflow
only_update_submission	Updates NCBI submissions

Documentation for using entrypoints with NF can be found at Nextflow_Entrypoints under section 5.

The following command can be used to specify entrypoints for the workflow:

nextflow run main.nf -profile <param set>,<env> -entry <insert option from table above>

Outputs

The following section walks through the outputs from the pipeline.

Pipeline Overview:

The workflow will generate outputs in the following order:

Validation
- Responsible for QC of metadata
- Aligns sample metadata .xlsx to sample .fasta
- Formats metadata into .tsv format
Annotation
- Extracts features from .gff
- Aligns features
- Annotates sample genomes outputting .gff
Submission
- Formats for database submission
- This section runs twice, with the second run occuring after a wait time to allow for all samples to be uploaded to NCBI. Entrypoint only_update_submission can be run as many times as necessary until all files are fully uploaded.

Output Directory Formatting:

The outputs are recorded in the directory specified within the nextflow.config file and will contain the following:

validation_outputs (**name configurable with val_output_dir)
- sample_metadata_run
  - errors
  - tsv_per_sample
liftoff_outputs (**name configurable with final_liftoff_output_dir)
- final_sample_metadata_file
  - errors
  - fasta
  - liftoff
  - tbl
submission_outputs (**name and path configurable with submission_output_dir)
- individual_sample_batch_info
  - biosample_sra
  - genbank
  - accessions.csv
- terminal_outputs
- commands_used
liftoffCommand.txt

Understanding Pipeline Outputs:

The pipeline outputs inlcude:

metadata.tsv files for each sample
separate fasta files for each sample
separate gff files for each sample
separate tbl files containing feature information for each sample
submission log file
- This output is found in the submission_outputs file in your specified output_directory
- If the file can not be found you can run the only_update_submission entrypoint for the pipeline

Parameters:

Default parameters are given in the nextflow.config file. This table lists the parameters that can be changed to a value, path or true/false. When changing these parameters pay attention to the required inputs and make sure that paths line-up and values are within range. To change a parameter you may change with a flag after the nextflow command or change them within your standard_params.config or standard.yaml file.

Please note the correct formatting and the default calculation of submission_wait_time at the bottom of the params table.

Input Files

Param	Description	Input Required
--fasta_path	Path to fasta file	Yes (path as string)
--ref_fasta_path	Reference Sequence file path	Yes (path as string)
--meta_path	Meta-data file path for samples	Yes (path as string)
--ref_gff_path	Reference gff file path for annotation	Yes (path as string)
--env_yml	Path to environment.yml file	Yes (path as string)

Run Environment

Param	Description	Input Required
--scicomp	Flag for whether running on Scicomp or not	Yes (true/false as bool)
--docker_container	Name of the Docker container	Yes, if running with docker profile (name as string)

General Subworkflow

Param	Description	Input Required
--run_submission	Toggle for running submission	Yes (true/false as bool)
--cleanup	Toggle for running cleanup subworkflows	Yes (true/false as bool)

Cleanup Subworkflow

Param	Description	Input Required
--clear_nextflow_log	Clears nextflow work log	Yes (true/false as bool)
--clear_nextflow_dir	Clears nextflow working directory	Yes (true/false as bool)
--clear_work_dir	Param to clear work directory created during workflow	Yes (true/false as bool)
--clear_conda_env	Clears conda environment	Yes (true/false as bool)
--clear_nf_results	Remove results from nextflow outputs	Yes (true/false as bool)

General Output

Param	Description	Input Required
--output_dir	File path to submit outputs from pipeline	Yes (path as string)
--overwrite_output	Toggle to overwriting output files in directory	Yes (true/false as bool)

Metadata Validation

Param	Description	Input Required
--val_output_dir	File path for outputs specific to validate sub-workflow	Yes (folder name as string)
--val_date_format_flag	Flag to change date output	Yes (-s, -o, or -v as string)
--val_keep_pi	Flag to keep personal identifying info, if provided otherwise it will return an error	Yes (true/false as bool)

Liftoff

Param	Description	Input Required
--final_liftoff_output_dir	File path to liftoff specific sub-workflow outputs	Yes (folder name as string)
--lift_print_version_exit	Print version and exit the program	Yes (true/false as bool)
--lift_print_help_exit	Print help and exit the program	Yes (true/false as bool)
--lift_parallel_processes	# of parallel processes to use for liftoff	Yes (integer)
--lift_delete_temp_files	Deletes the temporary files after finishing transfer	Yes (true/false as string)
--lift_child_feature_align_threshold	Only if its child features usually exons/CDS align with sequence identity â‰¥S	designate a feature mapped
--lift_unmapped_feature_file_name	Name of unmapped features file name	Yes (path as string)
--lift_copy_threshold	Minimum sequence identity in exons/CDS for which a gene is considered a copy; must be greater than -s; default is 1.0	Yes (float)
--lift_distance_scaling_factor	Distance scaling factor; by default D =2.0	Yes (float)
--lift_flank	Amount of flanking sequence to align as a fraction of gene length	Yes (float between [0.0-1.0])
--lift_overlap	Maximum fraction of overlap allowed by 2 features	Yes (float between [0.0-1.0])
--lift_mismatch	Mismatch penalty in exons when finding best mapping; by default M=2	Yes (integer)
--lift_gap_open	Gap open penalty in exons when finding best mapping; by default GO=2	Yes (integer)
--lift_gap_extend	Gap extend penalty in exons when finding best mapping; by default GE=1	Yes (integer)
--lift_infer_transcripts	Use if annotation file only includes exon/CDS features and does not include transcripts/mRNA	Yes (True/False as string)
--lift_copies	Look for extra gene copies in the target genome	Yes (True/False as string)
--lift_minimap_path	Path to minimap if you did not use conda or pip	Yes (N/A or path as string)
--lift_feature_database_name	Name of the feature database, if none, then will use ref gff path to construct one	Yes (N/A or name as string)

Submission

Param	Description	Input Required
--submission_output_dir	Either name or relative/absolute path for the outputs from submission	Yes (name or path as string)
--submission_prod_or_test	Whether to submit samples for test or actual production	Yes (prod or test as string)
--submission_only_meta	Full path directly to the dirs containing validate metadata files	Yes (path as string)
--submission_only_gff	Full path directly to the directory with reformatted GFFs	Yes (path as string)
--submission_only_fasta	Full path directly to the directory with split fastas for each sample	Yes (path as string)
--submission_config	Configuration file for submission to public repos	Yes (path as string)
--submission_wait_time	Calculated based on sample number (3 * 60 secs * sample_num)	integer (seconds)
--batch_name	Name of the batch to prefix samples with during submission	Yes (name as string)
--send_submission_email	Toggle email notification on/off	Yes (true/false as bool)
--req_col_config	Path to the required_columns.yaml file	Yes (path as string)
--processed_samples	Path to the directory containing processed samples for update only submission entrypoint (containing <batch_name>.<sample_name> dirs)	Yes (path as string)

** Important note about send_submission_email: An email is only triggered if Genbank is being submitted to AND table2asn is the genbank_submission_type. As for the recipient, this must be specified within your submission config file under 'general' as 'notif_email_recipient'

Helpful Links for Resources and Software Integrated with TOSTADAS:

🔗 Anaconda Install: https://docs.anaconda.com/anaconda/install/

🔗 Nextflow Documentation: https://www.nextflow.io/docs/latest/getstarted.html

🔗 SeqSender Documentation: https://github.com/CDCgov/seqsender

🔗 Liftoff Documentation: https://github.com/agshumate/Liftoff

🔗 VADR Documentation: https://github.com/ncbi/vadr.git

🔗 table2asn Documentation: https://github.com/svn2github/NCBI_toolkit/blob/master/src/app/table2asn/table2asn.cpp

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

TOSTADAS → Toolkit for Open Sequence Triage, Annotation and DAtabase Submission 🧬 💻

PATHOGEN ANNOTATION AND SUBMISSION PIPELINE

Overview

Table of Contents

Pipeline Summary

Metadata Validation

Liftoff

Submission

Setup

Environment Setup

(1) First install mamba:

(2) Add mamba to PATH:

(3) Now you can create the conda environment and install the dependencies set in your environment.yml:

(4) After the environment is created activate the environment. Always make sure to activate the environment with each new session.

(5) To examine which environment is active, run the following conda command: conda env list , then the active environment will be denoted with an asterisk*

(6) The final piece to the environment set up is to install nextflow (optionally with conda):

Repository Setup

Quick Start

(1) Ensure nextflow was installed successfully by running Nextflow -v

(2) Check that you are in the project directory (Tostadas).

(3) Change the submission_config parameter within test_params.config to the location of your personal submission config file.

(4) Run the following nextflow command to execute the scripts with default parameters and with local run environment:

Running the Pipeline

How to Run:

Profile Options & Input Files

Input Files Required:

(A) This table lists the required files to run metadata validation and liftoff annotation:

(B) This table lists the required files to run with submission:

Customizing Parameters:

Understanding Profiles and Environments:

Toggling Submission:

More Information on Submission:

Entrypoints:

Outputs

Pipeline Overview:

Output Directory Formatting:

Understanding Pipeline Outputs:

Parameters:

Input Files

Run Environment

General Subworkflow

Cleanup Subworkflow

General Output

Metadata Validation

Liftoff

Submission

Helpful Links for Resources and Software Integrated with TOSTADAS:

Acknowledgements

(5) To examine which environment is active, run the following conda command: `conda env list` , then the active environment will be denoted with an asterisk*

(1) Ensure nextflow was installed successfully by running `Nextflow -v`

(3) Change the `submission_config` parameter within `test_params.config` to the location of your personal submission config file.