Skip to content

RRBS Pipeline ReadMe

MikeWLloyd edited this page Apr 11, 2024 · 16 revisions

Reduced-Representation Bisulfite Sequencing (RRBS) Documentation

Reduced Representation Bisulfite Sequencing Pipeline (--workflow rrbs)

•	Step 1: FASTQC quality statistics  
•	Step 2: Read trimming with `Trim-galore` and trimmed read FASTQC quality statistics   
•	Step 3: Bismark alignment with either Bowtie2 or Hisat2  
•	Step 4: Bismark de-duplication [OPTIONAL]  
•	Step 5: Bismark methylation calling  
•	Step 6: Bismark methylation extraction   
•	Step 7: MultiQC report generation   

RRBS Flowchart

flowchart TD
	p1((Sample))
    p2[FASTQC]
    p3[TRIM_GALORE]
    p4[BISMARK_ALIGNMENT]
    p5[SAMTOOLS_SORT]
    p6[SAMTOOLS_INDEX]
    p7[BISMARK_DEDUPLICATION]
    p8[BISMARK_METHYLATION_EXTRACTION]
    p9[MULTIQC]
    o1([Genomic BAM]):::output
    o2([Bismark Output]):::output
    o3([MultiQC Report]):::output

    p1 --> |Raw Reads| p2
    p1 --> |Raw Reads| p3
    subgraph calling [  ]

        p3 --> p4
        p4 --> p5
        p5 --> p6
        p6 --> o1
        o1 -..-> |Deduplication used | p7
        p7 -..->  |Deduplication used | p8
        o1 --> |No deduplication used: Default| p8
        p8 --> o2
    end

    subgraph qc [  ]
    p2 --> p9
    p3 --> p9
    p4 --> p9
    p7 -..-> p9
    p8 --> p9
    p9 --> o3
    end

style qc stroke:#333,stroke-width:2px
style calling stroke:#333,stroke-width:2px

classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
Loading

Parameters for RRBS Pipeline

  • --pubdir

    • Default: /<PATH>
    • Comment: The directory that the saved outputs will be stored.
  • --organize_by

    • Default: sample
    • Comment: How to organize the output folder structure. Options: sample or analysis.
  • --cacheDir

    • Default: /projects/omics_share/meta/containers
    • Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
  • -w

    • Default: /<PATH>
    • Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
  • --sample_folder

    • Default: /<PATH>
    • Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
  • --extension

    • Default: .fastq.gz
    • Comment: The expected extension for the input read files.
  • --pattern

    • Default: "*_R{1,2}*"
    • Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this READ_NAME_R1_MoreText.fastq.gz or READ_NAME_R1.fastq.gz
  • --read_type

    • Default: PE
    • Comment: Options: PE and SE. Default: PE. Type of reads: paired end (PE) or single end (SE).
  • --concat_lanes

    • Default: false
    • Comment: Options: false and true. Default: false. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
  • --csv_input

    • Default: null
    • Comment: Provide a CSV manifest file with the header: "sampleID,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, * --download_data can be specified.
  • --download_data

    • Default: null
    • Comment: Requires * --csv_input. When specified, read data in the CSV manifest will be downloaded from provided URLs with Aria2.
  • --gen_org

    • Default: mouse
    • Comment: Options: mouse and human.
  • --genome_build

    • Default: GRCm38
    • Comment: Mouse specific. Options: GRCm38 or GRCm39. If gen_org == human, build defaults to GRCh38.
  • --non_directional

    • Default: true
    • Comment: Options: true and false. Selecting this option for non-directional RRBS libraries will screen quality-trimmed sequences for CAA or CGA at the start of the read and, if found, removes the first two base pairs.
  • --trimLength

    • Default: 30
    • Comment: Discard reads that became shorter than length 'INT' because of either quality or adapter trimming. A value of 0 effectively disables this behavior.
  • --qualThreshold

    • Default: 30
    • Comment: Trim low-quality ends from reads in addition to adapter removal. For RRBS samples, quality trimming will be performed first, and adapter trimming is carried in a second round. Other files are quality and adapter trimmed in a single pass. The algorithm is the same as the one used by BWA (Subtract INT from all qualities; compute partial sums from all indices to the end of the sequence; cut sequence at the index at which the sum is minimal).
  • --adapOverlap

    • Default: 1
    • Comment: Stringency for overlap with adapter sequence required to trim a sequence. Defaults to a very stringent setting of 1, i.e. even a single base pair of overlapping sequence will be trimmed of the 3' end of any read.
  • --adaptorSeq

    • Default: 'AGATCGGAAGAGC'
    • Comment: Adapter sequence to be trimmed. This sequence is the standard Illumina adapter sequence.
  • --seedLength

    • Default: 20
    • Comment: Sets the length of the seed substrings to align during multiseed alignment. Smaller values make alignment slower but more sensitive
  • --seedMismatch

    • Default: 0
    • Comment: Sets the number of mismatches to be allowed in a seed alignment during multiseed alignment. Can be set to 0 or 1. Setting this higher makes alignment slower (often much slower) but increases sensitivity.
  • --MinInsert

    • Default: 0
    • Comment: The minimum insert size for valid paired-end alignments. E.g. if -I 60 is specified and a paired-end alignment consists of two 20-bp alignments in the appropriate orientation with a 20-bp gap between them, that alignment is considered valid (as long as -X is also satisfied). A 19-bp gap would not be valid in that case.
  • --MaxInsert

    • Default: 1000
    • Comment: The maximum insert size for valid paired-end alignments. E.g. if -X 100 is specified and a paired-end alignment consists of two 20-bp alignments in the proper orientation with a 60-bp gap between them, that alignment is considered valid (as long as -I is also satisfied). A 61-bp gap would not be valid in that case
  • --ref_fa_index

    • Default: '/projects/omics_share/mouse/GRCm38/genome/indices/ensembl/v102/bismark/bowtie2'
    • Comment: Pre-compiled Bismark Bowtie2 index files. Points to human reference when --gen_org human. JAX users should change this parameter only if changing aligner used.
  • --aligner

    • Default: 'bowtie2'
    • Comment: Options: bowtie2 and hisat2. Bismark alignment tool. If hisat2 is specified, change ref_fa_index to the appropriate index files.
  • --skip_deduplication

    • Default: true
    • Comment: Options: true and false. Default: true. If false Bismark Deduplication will be used. "It is important to note that de-duplication is not recommended for RRBS, amplicon or other target enrichment-type libraries. This step is recommended for only whole-genome bisulfite samples."
  • --cytosine_report

    • Default: false
    • Comment: After the conversion to bedGraph has completed, the option --cytosine_report produces a genome-wide methylation report for all cytosines in the genome
  • --comprehensive

    • Default: true
    • Comment: Specifying this option will merge all four possible strand-specific methylation info into context-dependent output files.

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

Naming Convention Description
rrbs_report.html Nextflow autogenerated report
trace.txt Nextflow trace of processes
multiqc_report.html The MultiQC report showing all summary statistics found in the MultiQC directory
trimmed_fastq Directory containing trimmed FASTQ files. Output from Trim-Galore.
bam Directory containing BAM alignment files. If skip_deduplication = false de-duplicated BAM file is also present.
stats Directory containing all summary statistic TXT and HTML files. These files are summarized in the multiqc_report.html.
bismark_results Directory containing results from Bismark. See the Bismark Documentation for additional information.

CSV Input Sample Sheet

The required input header is: sampleID,lane,fastq_1,fastq_2. Samples can be provided either paired or un-paired.

  • The sampleID column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID.
  • The lane column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis.
  • The fastq_1 and fastq_2 columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.

Basic examples:

An example PE csv file:

sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz

An example SE csv file:

sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz
Clone this wiki locally