The Data Preprocessing Workflow

Summary

This workflow is a replicate of the QA protocol implemented at JGI for Illumina reads.

This workflow utilizes the program rqcfilter2 from BBTools to perform quality control on raw Illumina reads for shortreads. The workflow performs quality trimming, artifact removal, linker trimming, adapter trimming, and spike-in removal (using BBDuk), and performs human/cat/dog/mouse/microbe removal (using BMap).

This workflow performs quality control on long reads from PacBio. The workflow performs duplicate removal (using pbmarkdup), inverted repeat filtering (using BBTools icecreamfinder.sh), adapter trimming, and final filtering of reads with residual adapter sequences (using bbduk). The workflow is designed to handle input files in various formats, including .bam, .fq, or .fq.gz.

Required Database

RQCFilterData Database: It is a 106G tar file includes reference datasets of artifacts, adapters, contaminants, phiX genome, host genomes.
Prepare the Database

	mkdir -p refdata
	wget https://portal.nersc.gov/cfs/m3408/db/RQCFilterData.tgz
	tar xvzf RQCFilterData.tgz -C refdata
	rm RQCFilterData.tgz

The Docker image and Dockerfile can be found here

microbiomedata/bbtools:38.96

Input files

the path to the interleaved fastq file (longreads and shortreads)
forwards reads fastq file (when input_interleaved is false)
reverse reads fastq file (when input_interleaved is false)
project id
if the input is interleaved (boolean)
if the input is shortreads (boolean)

{
	"rqcfilter.input_files": ["https://portal.nersc.gov/project/m3408//test_data/smalltest.int.fastq.gz"],
    	"rqcfilter.input_fq1": [],
    	"rqcfilter.input_fq2": [],
    	"rqcfilter.proj": "nmdc:xxxxxxx",
   	"rqcfilter.interleaved": true,
    	"rqcfilter.shortRead": true
}

Output files

The output will have one directory named by prefix of the fastq input file and a bunch of output files, including statistical numbers, status log and a shell script to reproduce the steps etc.

The main QC fastq output is named by prefix.anqdpht.fast.gz.

* Short Reads
    output/
    ├── nmdc_xxxxxxx_filtered.fastq.gz
    ├── nmdc_xxxxxxx_filterStats.txt
    ├── nmdc_xxxxxxx_filterStats2.txt
    ├── nmdc_xxxxxxx_readsQC.info
    └── nmdc_xxxxxxx_qa_stats.json
# Long Reads
    output/
    ├── nmdc_xxxxxxx_pbmarkdupStats.txt
    ├── nmdc_xxxxxxx_readsQC.info
    ├── nmdc_xxxxxxx_bbdukEndsStats.json
    ├── nmdc_xxxxxxx_icecreamStats.json
    ├── nmdc_xxxxxxx_filtered.fastq.gz
    └── nmdc_xxxxxxx_stats.json

Link to Doc Site

Please refer here for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
.github/workflows		.github/workflows
Docker		Docker
docs		docs
old_wdls		old_wdls
test		test
.gitignore		.gitignore
README.md		README.md
imports.zip		imports.zip
input.json		input.json
interleave_rqcfilter.wdl		interleave_rqcfilter.wdl
labels.json		labels.json
longReadsqc.wdl		longReadsqc.wdl
rqcfilter.wdl		rqcfilter.wdl
shifter.conf		shifter.conf
shortReadsqc.wdl		shortReadsqc.wdl
submit.sh		submit.sh
submit_curl.sh		submit_curl.sh
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Data Preprocessing Workflow

Summary

Required Database

The Docker image and Dockerfile can be found here

Input files

Output files

Link to Doc Site

About

Releases 15

Packages

Contributors 6

Languages

microbiomedata/ReadsQC

Folders and files

Latest commit

History

Repository files navigation

The Data Preprocessing Workflow

Summary

Required Database

The Docker image and Dockerfile can be found here

Input files

Output files

Link to Doc Site

About

Resources

Stars

Watchers

Forks

Releases 15

Packages 0

Contributors 6

Languages

Packages