-
Notifications
You must be signed in to change notification settings - Fork 10
RRBS Pipeline ReadMe
MikeWLloyd edited this page Apr 11, 2024
·
16 revisions
• Step 1: FASTQC quality statistics
• Step 2: Read trimming with `Trim-galore` and trimmed read FASTQC quality statistics
• Step 3: Bismark alignment with either Bowtie2 or Hisat2
• Step 4: Bismark de-duplication [OPTIONAL]
• Step 5: Bismark methylation calling
• Step 6: Bismark methylation extraction
• Step 7: MultiQC report generation
flowchart TD
p1((Sample))
p2[FASTQC]
p3[TRIM_GALORE]
p4[BISMARK_ALIGNMENT]
p5[SAMTOOLS_SORT]
p6[SAMTOOLS_INDEX]
p7[BISMARK_DEDUPLICATION]
p8[BISMARK_METHYLATION_EXTRACTION]
p9[MULTIQC]
o1([Genomic BAM]):::output
o2([Bismark Output]):::output
o3([MultiQC Report]):::output
p1 --> |Raw Reads| p2
p1 --> |Raw Reads| p3
subgraph calling [ ]
p3 --> p4
p4 --> p5
p5 --> p6
p6 --> o1
o1 -..-> |Deduplication used | p7
p7 -..-> |Deduplication used | p8
o1 --> |No deduplication used: Default| p8
p8 --> o2
end
subgraph qc [ ]
p2 --> p9
p3 --> p9
p4 --> p9
p7 -..-> p9
p8 --> p9
p9 --> o3
end
style qc stroke:#333,stroke-width:2px
style calling stroke:#333,stroke-width:2px
classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
-
--pubdir
- Default:
/<PATH>
- Comment: The directory that the saved outputs will be stored.
- Default:
-
--organize_by
- Default:
sample
- Comment: How to organize the output folder structure. Options: sample or analysis.
- Default:
-
--cacheDir
- Default:
/projects/omics_share/meta/containers
- Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
- Default:
-
-w
- Default:
/<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
- Default:
-
--sample_folder
- Default:
/<PATH>
- Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
- Default:
-
--extension
- Default:
.fastq.gz
- Comment: The expected extension for the input read files.
- Default:
-
--pattern
- Default:
"*_R{1,2}*"
- Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this
READ_NAME_R1_MoreText.fastq.gz
orREAD_NAME_R1.fastq.gz
- Default:
-
--read_type
- Default:
PE
- Comment: Options:
PE
andSE
. Default:PE
. Type of reads: paired end (PE) or single end (SE).
- Default:
-
--concat_lanes
- Default:
false
- Comment: Options:
false
andtrue
. Default:false
. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
- Default:
-
--csv_input
- Default: null
- Comment: Provide a CSV manifest file with the header: "sampleID,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, *
--download_data
can be specified.
-
--download_data
- Default: null
- Comment: Requires *
--csv_input
. When specified, read data in the CSV manifest will be downloaded from provided URLs with Aria2.
-
--gen_org
- Default:
mouse
- Comment: Options:
mouse
andhuman
.
- Default:
-
--genome_build
- Default:
GRCm38
- Comment: Mouse specific. Options: GRCm38 or GRCm39. If gen_org == human, build defaults to GRCh38.
- Default:
-
--non_directional
- Default:
true
- Comment: Options: true and false. Selecting this option for non-directional RRBS libraries will screen quality-trimmed sequences for CAA or CGA at the start of the read and, if found, removes the first two base pairs.
- Default:
-
--trimLength
- Default:
30
- Comment: Discard reads that became shorter than length 'INT' because of either quality or adapter trimming. A value of 0 effectively disables this behavior.
- Default:
-
--qualThreshold
- Default:
30
- Comment: Trim low-quality ends from reads in addition to adapter removal. For RRBS samples, quality trimming will be performed first, and adapter trimming is carried in a second round. Other files are quality and adapter trimmed in a single pass. The algorithm is the same as the one used by BWA (Subtract INT from all qualities; compute partial sums from all indices to the end of the sequence; cut sequence at the index at which the sum is minimal).
- Default:
-
--adapOverlap
- Default:
1
- Comment: Stringency for overlap with adapter sequence required to trim a sequence. Defaults to a very stringent setting of 1, i.e. even a single base pair of overlapping sequence will be trimmed of the 3' end of any read.
- Default:
-
--adaptorSeq
- Default:
'AGATCGGAAGAGC'
- Comment: Adapter sequence to be trimmed. This sequence is the standard Illumina adapter sequence.
- Default:
-
--seedLength
- Default:
20
- Comment: Sets the length of the seed substrings to align during multiseed alignment. Smaller values make alignment slower but more sensitive
- Default:
-
--seedMismatch
- Default:
0
- Comment: Sets the number of mismatches to be allowed in a seed alignment during multiseed alignment. Can be set to 0 or 1. Setting this higher makes alignment slower (often much slower) but increases sensitivity.
- Default:
-
--MinInsert
- Default:
0
- Comment: The minimum insert size for valid paired-end alignments. E.g. if -I 60 is specified and a paired-end alignment consists of two 20-bp alignments in the appropriate orientation with a 20-bp gap between them, that alignment is considered valid (as long as -X is also satisfied). A 19-bp gap would not be valid in that case.
- Default:
-
--MaxInsert
- Default:
1000
- Comment: The maximum insert size for valid paired-end alignments. E.g. if -X 100 is specified and a paired-end alignment consists of two 20-bp alignments in the proper orientation with a 60-bp gap between them, that alignment is considered valid (as long as -I is also satisfied). A 61-bp gap would not be valid in that case
- Default:
-
--ref_fa_index
- Default:
'/projects/omics_share/mouse/GRCm38/genome/indices/ensembl/v102/bismark/bowtie2'
- Comment: Pre-compiled Bismark Bowtie2 index files. Points to human reference when --gen_org human. JAX users should change this parameter only if changing aligner used.
- Default:
-
--aligner
- Default:
'bowtie2'
- Comment: Options: bowtie2 and hisat2. Bismark alignment tool. If hisat2 is specified, change
ref_fa_index
to the appropriate index files.
- Default:
-
--skip_deduplication
- Default:
true
- Comment: Options:
true
andfalse
. Default:true
. Iffalse
Bismark Deduplication will be used. "It is important to note that de-duplication is not recommended for RRBS, amplicon or other target enrichment-type libraries. This step is recommended for only whole-genome bisulfite samples."
- Default:
-
--cytosine_report
- Default:
false
- Comment: After the conversion to bedGraph has completed, the option --cytosine_report produces a genome-wide methylation report for all cytosines in the genome
- Default:
-
--comprehensive
- Default:
true
- Comment: Specifying this option will merge all four possible strand-specific methylation info into context-dependent output files.
- Default:
NOTE: *
Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.
Naming Convention | Description |
---|---|
rrbs_report.html |
Nextflow autogenerated report |
trace.txt |
Nextflow trace of processes |
multiqc_report.html |
The MultiQC report showing all summary statistics found in the MultiQC directory |
trimmed_fastq |
Directory containing trimmed FASTQ files. Output from Trim-Galore. |
bam |
Directory containing BAM alignment files. If skip_deduplication = false de-duplicated BAM file is also present. |
stats |
Directory containing all summary statistic TXT and HTML files. These files are summarized in the multiqc_report.html . |
bismark_results |
Directory containing results from Bismark. See the Bismark Documentation for additional information. |
The required input header is: sampleID,lane,fastq_1,fastq_2
. Samples can be provided either paired or un-paired.
- The
sampleID
column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID. - The
lane
column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis. - The
fastq_1
andfastq_2
columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.
sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz
sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz