This workflow is a replicate of the QA protocol implemented at JGI for Illumina reads and use the program “rqcfilter2” from BBTools(38:96) which implements them as a pipeline.
-
RQCFilterData Database: It is a 106G tar file includes reference datasets of artifacts, adapters, contaminants, phiX genome, host genomes.
-
Prepare the Database
mkdir -p refdata
wget https://portal.nersc.gov/cfs/m3408/db/RQCFilterData.tgz
tar xvzf RQCFilterData.tgz -C refdata
rm RQCFilterData.tgz
Description of the files:
.wdl
file: the WDL file for workflow definition.json
file: the example input for the workflow.conf
file: the conf file for running Cromwell..sh
file: the shell script for running the example workflow
- the path to the database directory
- the path to the fastq file(s) ([R1, R2] if not interleaved)
- input_interleaved (boolean)
- output file prefix
- (optional) parameters for memory
- (optional) number of threads requested
{
"metaTReadsQC.input_files": ["https://portal.nersc.gov/project/m3408//test_data/metaT/SRR11678315.fastq.gz"],
"metaTReadsQC.proj":"SRR11678315-int-0.1",
"metaTReadsQC.rqc_mem": 180,
"metaTReadsQC.rqc_thr": 64,
"metaTReadsQC.database": "/refdata/"
}
The output will have one directory named by prefix of the fastq input file and a bunch of output files, including statistical numbers, status log and a shell script to reproduce the steps etc.
The main QC fastq output is named by prefix.fastq.gz.
|-- nmdc_xxxxxxx_filtered.fastq.gz
|-- nmdc_xxxxxxx_filterStats.txt
|-- nmdc_xxxxxxx_filterStats2.txt
|-- nmdc_xxxxxxx_qa_stats.json
|-- filtered/adaptersDetected.fa
|-- filtered/reproduce.sh
|-- filtered/spikein.fq.gz
|-- filtered/status.log
|-- ...