General disclaimer: This repository was created for use by CDC programs to collaborate on public health related projects in support of the CDC mission. GitHub is not hosted by the CDC, but is a third party website used by CDC and its partners to share information and collaborate on software. CDC use of GitHub does not imply an endorsement of any one particular service, product, or enterprise.
Use of this service is limited only to non-sensitive and publicly available data. Users must not use, share, or store any kind of sensitive data like health status, provision or payment of healthcare, Personally Identifiable Information (PII) and/or Protected Health Information (PHI), etc. under ANY circumstance.
Administrators for this service reserve the right to moderate all information used, shared, or stored with this service at any time. Any user that cannot abide by this disclaimer and Code of Conduct may be subject to action, up to and including revoking access to services.
The material embodied in this software is provided to you "as-is" and without warranty of any kind, express, implied or otherwise, including without limitation, any warranty of fitness for a particular purpose. In no event shall the Centers for Disease Control and Prevention (CDC) or the United States (U.S.) government be liable to you or anyone else for any direct, special, incidental, indirect or consequential damages of any kind, or any damages whatsoever, including without limitation, loss of profit, loss of use, savings or revenue, or the claims of third parties, whether or not CDC or the U.S. government has been advised of the possibility of such loss, however caused and on any theory of liability, arising out of or in connection with the possession, use or performance of this software.
mira-nf/mira is a bioinformatics pipeline that assembles Influenza genomes, SARS-CoV-2 genomes, the SARS-CoV-2 spike-gene and RSV genomes when given the raw fastq files and a samplesheet. mira-nf/mira can analyze reads from both Illumina and OxFord Nanopore sequencing machines.
MIRA performs these steps for genome assembly and curation:
- Read QC (optional) (
FastQC
) - Present QC for raw reads (optional) (
MultiQC
) - Checking chemistry in fastq files (optional) (
python
) - Subsampling to faster analysis (optional) (
bbtools
) - Trimming and Quality Filtering (
bbduk
) - Adapter removal (
cutadapt
) - Genome Assembly (
IRMA
) - Annotation of assembly (
DAIS-ribosome
) - Collect results from IRMA and DAIS-Ribosome in json files
- Create html, excel files and amended consensus fasta files
- Reformat tables into parquet files and csv files
MIRA is able to analyze 7 data types:
- Flu-Illumina - Flu whole genome data created with an illumina machine
- Flu-ONT - Flu whole genome data created with an OxFord Nanopore machine
- SC2-Whole-Genome-Illumina - SARS-CoV-2 whole genome data created with an illumina machine
- SC2-Whole-Genome-ONT - SARS-CoV-2 whole genome data created with an OxFord Nanopore machine
- SC2-Spike-Only-ONT - SARS-CoV-2 spike protein data created with an OxFord Nanopore machine
- RSV-Illumina - RSV whole genome data created with an illumina machine
- RSV-ONT - RSV whole genome data created with an OxFord Nanopore machine
To run this pipeline you will need to have these programs installed:
- Nextflow - If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow.
- singularity-ce or docker - Information on how to install singularity-ce can be found here and information to install docker can be found here.
- git - Information about git installation can be found here.
Make sure to test your setup with -profile test,<singularity or docker>
to ensure that everything is installed properly before running the workflow on actual data. If you would like to further test the pipeline using our test data it can be downloaded from this link:
- Tiny test data from ONT Influenza genome and SARS-CoV-2-spike - 40Mb Download.
- Full test data set - the data set from above + full genomes of Influenza and SARS-CoV-2 from Illumina MiSeqs 1 Gb Download.
To run this pipeline with the MIRA setup:
First, prepare a samplesheet with your input data that looks as follows:
samplesheet.csv
:
Illumina data should be set up as follows:
Sample ID,Sample Type
sample_1,Test
sample_2,Test
sample_3,Test
sample_4,Test
Oxford Nanopore data should be set up as follows:
Barcode #,Sample ID,Sample Type
barcode07,s1,Test
barcode37,s2,Test
barcode41,s3,Test
Each row represents a sample.
Important things to note about samplesheet:
- Sample names within the "Sample ID" column need to be unique.
- Be sure that sample names are not nested within another sample name (i.e. having sample_1 and sample_1_1)
- Be sure that there are no empty lines at the end of the samplesheet.
- For Illumina samples be sure that you have read 1 and read 2 for all samples in samplesheet.
To use the nextflow samplesheet setup please refer to the usage document (../assets/usage.md). USING THE NEXTFLOW SAMPLESHEET SET UP WITH ONT DATA WILL REQUIRE YOU TO COMBINE ONT FASTQS YOURSELF.
Second, move samplesheet into a run folder with fastq files:
Illumina set up should be set up as follows:
- <RUN_PATH>/fastqs <- all fastqs should be out at this level
- <RUN_PATH>/samplesheet.csv
Oxford Nanopore set up should be set up as follows:
- <RUN_PATH>/fastq_pass <- fastqs should be within barcode folders as given by ONT machine
- <RUN_PATH>/samplesheet.csv
Note: The name of the run folder will be used to name outputs files.
Third, pull the mira-nf work flow from github using:
git clone https://github.com/CDCgov/MIRA-NF.git
cd MIRA-NF
**using dev branch temporary
Now, you can run the pipeline using two methods: locally or within a high computing cluster. In both cases you will need to launch the workflow from the mira-nf folder.
profile
- singularity,docker,local,sge,slurm \ You can use docker or singularity. Use local for running on local computer.input
- <RUN_PATH>/samplesheet.csv with the format described above. The full file path is required.outdir
- The file path to where you would like the output directory to write the files. The full file path is required.runpath
- The <RUN_PATH> where the samplesheet is located. Your fastq_folder and samplesheet.csv should be in here. The full file path is required.e
- experiment type, options: Flu-ONT, SC2-Spike-Only-ONT, Flu-Illumina, SC2-Whole-Genome-ONT, SC2-Whole-Genome-Illumina, RSV-Illumina, RSV-ONT.
all commands listed below can not be included in run command and the defaults will be used
p
- provide a built in primer schema if using experiment type SC2-Whole-Genome-Illumina. SARS-CoV-2 options: articv3, articv4, articv4.1, articv5.3.2, qiagen, swift, swift_211206. RSV options: RSV_CDC_8amplicon_230901 Will be overwritten by custom_primers flag if both flags are providedcustom_primers
- provide a custom primer schema by entering the file path to your own custom primer fasta file. Must be fasta formatted. Trimming will only work with custom primers that are greater than 15bpread_qc
- (optional) Run FastQC and MultiQC. Default: false.reformat_tables
- (optional) flag to reformat report tables into parquet files and csv files (boolean). Default set to false.subsample_reads
- (optional) The number of reads that used for subsampling. Paired reads for Illumina data and single reads for ONT data. Default is set to skip subsampling process using value 0.process_q
- (required for hpc profile) provide the name of the processing queue that will submit to the queue.email
- (optional) provide an email if you would like to receive an email with the irma summary upon completion.irma_module
- (optional) Call flu-sensitive, flu-secondary or flu-utr irma module instead of the built in flu configs. Default is set to not use these module and they can only be invoked for Flu-Illumina experiment type. options: sensitive, secondary or utrcustom_irma_config
- (optional) Provide a custom IRMA config file to be used with IRMA assembly. File path to file needed.custom_qc_settings
- (optional) Provide custom qc pass/fail settings for constructing the summary files. Default settings can be found in ../bin/irma_config/qc_pass_fail_settings.yaml. File path to file needed.amd_platform
- (optional) This flag allows the user to skip the "Nextflow samplesheet creation" step. It will require the user to provide a different samplesheet that is described under "Nextflow samplesheet setup" in the usage.md document. Please read the usage.md fully before implementing this flag. Default false. Options true or falseecr_registry
- (optional) Allows a user to pass their ecr registry for AWS to the workflow.sourcepath
- (optional) If sourcepath flag is given, then it will use the sourcepath to point to the reference files, primer fastas and support files in all trimming modules, prepareIRMAjson and staticHTML. This flag is for if one can not place the entire repo in their working directory.
To run locally you will need to install Nextflow and singularity-ce or docker on your computer (see links above for details) or you can use an interactive session on an hpc. The command will be run as seen below:
nextflow run ./main.nf \
-profile singularity,local \
--input <RUN_PATH>/samplesheet.csv \
--outdir <OUTDIR> \
--runpath <RUN_PATH> \
--e <EXPERIMENT_TYPE> \
--p <PRIMER_SCHEMA> (optional) \
--custom_primers <CUSTOM_PRIMERS> <FILE_PATH>/custom_primer.fasta (optional) \
--subsample_reads <READ_COUNT> (optional)\
--reformat_tables true (optional) \
--read_qc false (optional) \
To run in a high computing cluster you will need to add sge or slurm to the profile and provide a queue name for the queue that you would like jobs to be submitting to:
nextflow run ./main.nf \
-profile singularity,sge \
--input <RUN_PATH>/samplesheet.csv \
--outdir <RUN_PATH> \
--runpath <RUN_PATH> \
--e <EXPERIMENT_TYPE> \
--p <PRIMER_SCHEMA> (optional) \
--custom_primers <CUSTOM_PRIMERS> <FILE_PATH>/custom_primer.fasta (optional) \
--process_q <QUEUE_NAME> \
--reformat_tables true (optional) \
--email <EMAIL_ADDRESS> (optional) \
--read_qc false (optional)
For running MIRA-NF in AWS, example parameter json files for all data types can be found under ../samples/examples.
For in house testing:
qsub MIRA_nextflow.sh \
-d <FILE_PATH_TO_MIRA-NF_DIR> \
-f singularity,sge \
-i <RUN_PATH>/samplesheet.csv \
-o <OUTDIR> \
-r <RUN_PATH> \
-e <EXPERIMENT_TYPE> \
-p <PRIMER_SCHEMA> \ (optional)
-g <FILE_PATH>/custom_primer.fasta \ (optional)
-q <QUEUE_NAME> \
-a <REFORMAT_TABLES> \ (optional)
-c <SUBSAMPLED_READ_COUNTS> \ (optional)
-b <OTHER_IRMA_MODULE> (optional)
-m <EMAIL_ADDRESS> \ (optional)
-k <READ_QC> \ (optional)
Warning
Please provide pipeline parameters via the NF or Nextflow -params-file
option. Custom config files including those provided by the -c
Nextflow option can be used to provide any configuration except for parameters;
see docs.
mira-nf/mira is developed and maintained by Ben Rambo-Martin, Kristine Lacek, Reina Chau, and Amanda Sullivan.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.
The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.
This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.
This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.
You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html
The source code forked from other open source projects will inherit its license.
This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.
Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.
All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.
This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.
Please refer to CDC's Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.