This workflow is a replicate of the QA protocol implemented at JGI for Illumina reads.
This workflow utilizes the program rqcfilter2
from BBTools to perform quality control on raw Illumina reads for shortreads. The workflow performs quality trimming, artifact removal, linker trimming, adapter trimming, and spike-in removal (using BBDuk
), and performs human/cat/dog/mouse/microbe removal (using BMap
).
This workflow performs quality control on long reads from PacBio. The workflow performs duplicate removal (using pbmarkdup
), inverted repeat filtering (using BBTools
icecreamfinder.sh
), adapter trimming, and final filtering of reads with residual adapter sequences (using bbduk
). The workflow is designed to handle input files in various formats, including .bam, .fq, or .fq.gz.
-
RQCFilterData Database: It is a 106G tar file includes reference datasets of artifacts, adapters, contaminants, phiX genome, host genomes.
-
Prepare the Database
mkdir -p refdata
wget https://portal.nersc.gov/cfs/m3408/db/RQCFilterData.tgz
tar xvzf RQCFilterData.tgz -C refdata
rm RQCFilterData.tgz
- the path to the interleaved fastq file (longreads and shortreads)
- forwards reads fastq file (when input_interleaved is false)
- reverse reads fastq file (when input_interleaved is false)
- project id
- if the input is interleaved (boolean)
- if the input is shortreads (boolean)
{
"rqcfilter.input_files": ["https://portal.nersc.gov/project/m3408//test_data/smalltest.int.fastq.gz"],
"rqcfilter.input_fq1": [],
"rqcfilter.input_fq2": [],
"rqcfilter.proj": "nmdc:xxxxxxx",
"rqcfilter.interleaved": true,
"rqcfilter.shortRead": true
}
The output will have one directory named by prefix of the fastq input file and a bunch of output files, including statistical numbers, status log and a shell script to reproduce the steps etc.
The main QC fastq output is named by prefix.anqdpht.fast.gz.
* Short Reads
output/
├── nmdc_xxxxxxx_filtered.fastq.gz
├── nmdc_xxxxxxx_filterStats.txt
├── nmdc_xxxxxxx_filterStats2.txt
├── nmdc_xxxxxxx_readsQC.info
└── nmdc_xxxxxxx_qa_stats.json
# Long Reads
output/
├── nmdc_xxxxxxx_pbmarkdupStats.txt
├── nmdc_xxxxxxx_readsQC.info
├── nmdc_xxxxxxx_bbdukEndsStats.json
├── nmdc_xxxxxxx_icecreamStats.json
├── nmdc_xxxxxxx_filtered.fastq.gz
└── nmdc_xxxxxxx_stats.json
Please refer here for more information.