RRBS Pipeline ReadMe

Reduced-Representation Bisulfite Sequencing (RRBS) Documentation

Reduced Representation Bisulfite Sequencing Pipeline (--workflow rrbs)

•	Step 1: FASTQC quality statistics  
•	Step 2: Read trimming with `Trim-galore` and trimmed read FASTQC quality statistics   
•	Step 3: Bismark alignment with either Bowtie2 or Hisat2  
•	Step 4: Bismark de-duplication [OPTIONAL]  
•	Step 5: Bismark methylation calling  
•	Step 6: Bismark methylation extraction   
•	Step 7: MultiQC report generation

RRBS Flowchart

flowchart TD
	p1((Sample))
    p2[FASTQC]
    p3[TRIM_GALORE]
    p4[BISMARK_ALIGNMENT]
    p5[SAMTOOLS_SORT]
    p6[SAMTOOLS_INDEX]
    p7[BISMARK_DEDUPLICATION]
    p8[BISMARK_METHYLATION_EXTRACTION]
    p9[MULTIQC]
    o1([Genomic BAM]):::output
    o2([Bismark Output]):::output
    o3([MultiQC Report]):::output

    p1 --> |Raw Reads| p2
    p1 --> |Raw Reads| p3
    subgraph calling [  ]

        p3 --> p4
        p4 --> p5
        p5 --> p6
        p6 --> o1
        o1 -..-> |Deduplication used | p7
        p7 -..->  |Deduplication used | p8
        o1 --> |No deduplication used: Default| p8
        p8 --> o2
    end

    subgraph qc [  ]
    p2 --> p9
    p3 --> p9
    p4 --> p9
    p7 -..-> p9
    p8 --> p9
    p9 --> o3
    end

style qc stroke:#333,stroke-width:2px
style calling stroke:#333,stroke-width:2px

classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000

Parameters for RRBS Pipeline

--pubdir
- Default: /<PATH>
- Comment: The directory that the saved outputs will be stored.
--organize_by
- Default: sample
- Comment: How to organize the output folder structure. Options: sample or analysis.
--cacheDir
- Default: /projects/omics_share/meta/containers
- Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
-w
- Default: /<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
--sample_folder
- Default: /<PATH>
- Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
--extension
- Default: .fastq.gz
- Comment: The expected extension for the input read files.
--pattern
- Default: "*_R{1,2}*"
- Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this READ_NAME_R1_MoreText.fastq.gz or READ_NAME_R1.fastq.gz
--read_type
- Default: PE
- Comment: Options: PE and SE. Default: PE. Type of reads: paired end (PE) or single end (SE).
--concat_lanes
- Default: false
- Comment: Options: false and true. Default: false. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
--csv_input
- Default: null
- Comment: Provide a CSV manifest file with the header: "sampleID,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, * --download_data can be specified.
--download_data
- Default: null
- Comment: Requires * --csv_input. When specified, read data in the CSV manifest will be downloaded from provided URLs with Aria2.
--gen_org
- Default: mouse
- Comment: Options: mouse and human.
--genome_build
- Default: GRCm38
- Comment: Mouse specific. Options: GRCm38 or GRCm39. If gen_org == human, build defaults to GRCh38.
--non_directional
- Default: true
- Comment: Options: true and false. Selecting this option for non-directional RRBS libraries will screen quality-trimmed sequences for CAA or CGA at the start of the read and, if found, removes the first two base pairs.
--trimLength
- Default: 30
- Comment: Discard reads that became shorter than length 'INT' because of either quality or adapter trimming. A value of 0 effectively disables this behavior.
--qualThreshold
- Default: 30
- Comment: Trim low-quality ends from reads in addition to adapter removal. For RRBS samples, quality trimming will be performed first, and adapter trimming is carried in a second round. Other files are quality and adapter trimmed in a single pass. The algorithm is the same as the one used by BWA (Subtract INT from all qualities; compute partial sums from all indices to the end of the sequence; cut sequence at the index at which the sum is minimal).
--adapOverlap
- Default: 1
- Comment: Stringency for overlap with adapter sequence required to trim a sequence. Defaults to a very stringent setting of 1, i.e. even a single base pair of overlapping sequence will be trimmed of the 3' end of any read.
--adaptorSeq
- Default: 'AGATCGGAAGAGC'
- Comment: Adapter sequence to be trimmed. This sequence is the standard Illumina adapter sequence.
--seedLength
- Default: 20
- Comment: Sets the length of the seed substrings to align during multiseed alignment. Smaller values make alignment slower but more sensitive
--seedMismatch
- Default: 0
- Comment: Sets the number of mismatches to be allowed in a seed alignment during multiseed alignment. Can be set to 0 or 1. Setting this higher makes alignment slower (often much slower) but increases sensitivity.
--MinInsert
- Default: 0
- Comment: The minimum insert size for valid paired-end alignments. E.g. if -I 60 is specified and a paired-end alignment consists of two 20-bp alignments in the appropriate orientation with a 20-bp gap between them, that alignment is considered valid (as long as -X is also satisfied). A 19-bp gap would not be valid in that case.
--MaxInsert
- Default: 1000
- Comment: The maximum insert size for valid paired-end alignments. E.g. if -X 100 is specified and a paired-end alignment consists of two 20-bp alignments in the proper orientation with a 60-bp gap between them, that alignment is considered valid (as long as -I is also satisfied). A 61-bp gap would not be valid in that case
--ref_fa_index
- Default: '/projects/omics_share/mouse/GRCm38/genome/indices/ensembl/v102/bismark/bowtie2'
- Comment: Pre-compiled Bismark Bowtie2 index files. Points to human reference when --gen_org human. JAX users should change this parameter only if changing aligner used.
--aligner
- Default: 'bowtie2'
- Comment: Options: bowtie2 and hisat2. Bismark alignment tool. If hisat2 is specified, change ref_fa_index to the appropriate index files.
--skip_deduplication
- Default: true
- Comment: Options: true and false. Default: true. If false Bismark Deduplication will be used. "It is important to note that de-duplication is not recommended for RRBS, amplicon or other target enrichment-type libraries. This step is recommended for only whole-genome bisulfite samples."
--cytosine_report
- Default: false
- Comment: After the conversion to bedGraph has completed, the option --cytosine_report produces a genome-wide methylation report for all cytosines in the genome
--comprehensive
- Default: true
- Comment: Specifying this option will merge all four possible strand-specific methylation info into context-dependent output files.

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

Naming Convention	Description
`rrbs_report.html`	Nextflow autogenerated report
`trace.txt`	Nextflow trace of processes
`multiqc_report.html`	The MultiQC report showing all summary statistics found in the `MultiQC` directory
`trimmed_fastq`	Directory containing trimmed FASTQ files. Output from Trim-Galore.
`bam`	Directory containing BAM alignment files. If `skip_deduplication = false` de-duplicated BAM file is also present.
`stats`	Directory containing all summary statistic TXT and HTML files. These files are summarized in the `multiqc_report.html`.
`bismark_results`	Directory containing results from Bismark. See the Bismark Documentation for additional information.

CSV Input Sample Sheet

The required input header is: sampleID,lane,fastq_1,fastq_2. Samples can be provided either paired or un-paired.

The sampleID column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID.
The lane column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis.
The fastq_1 and fastq_2 columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.

Basic examples:

An example PE csv file:

sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz

An example SE csv file:

sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RRBS Pipeline ReadMe

Reduced-Representation Bisulfite Sequencing (RRBS) Documentation

Reduced Representation Bisulfite Sequencing Pipeline (--workflow rrbs)

RRBS Flowchart

Parameters for RRBS Pipeline

Pipeline Default Outputs

CSV Input Sample Sheet

Basic examples:

An example PE csv file:

An example SE csv file:

Home

Pipeline Documentation

Benchmarking Documentation

Pipeline development and Release Documentation

Clone this wiki locally