Skip to content

ATAC Pipeline ReadMe

MikeWLloyd edited this page Sep 26, 2024 · 18 revisions

Assay for Transposase-Accessible Chromatin Sequencing (ATAC) Documentation

ATAC Sequencing Pipeline (--workflow atac)

•	Step 1: Trim FASTQ reads   
•	Step 2: FASTQC quality statistics    
•	Step 3: Align to reference    
•	Step 4: Sort alignments    
•	Step 5: Filter duplicate reads from alignments    
•	Step 6: Calculate mitochondrial percentage and remove MT reads.    
•	Step 7: Filter non-unique and include only 'properly mapped reads' alignments    
•	Step 8: Run deeptools alignmentSieve to offset reads by +4 bp on + strand, and −5 bp on the – strand. Tn5 transposase has been shown to bind as a dimer and insert two adaptors separated by 9 bp.    
•	Step 9: Re-sort shifted bam    

Mouse - NON-B6 Strains:

•	Step 10: Liftover mapped read coordinates using chain file. Sort bam by coordinates, and remove reads that failed to liftover. Sort and Re-index lifted BAM.   

Mouse - B6 Strains:

•	Step 10: Index BAM file  

Human:

•	Step 10: Index BAM file  

Both:

•	Step 11: Peak calling with MACS2   
•	Step 12: Calculate fraction of reads in peak, peak depths, and final QC metrics.
•	Step 13: MultiQC report generation    

ATAC - Mouse Flowchart

flowchart TD
	p1((Sample))

    p2[CUTADAPT]
    p3[FASTQC]
    p4[ALIGN_TRIMMED_FASTQ]

    p5[SORT_ALIGN_TRIM]
    p6[PICARD_MARKDUPLICATES]

    p7[REMOVE_DUPLICATE_READS]

    p8[CALC_MTDNA_FILTER_CHRM]

    p9[FILTER_REMOVE_MULTI_SHIFT]
    p10[FILTER_REMOVE_MULTI_SIEVE]

    p11[SORT_SHIFTED_BAM]
    p12[CHAIN_CONVERT]

    p13[SORT_LIFTOVER_BAM]
    p14[CHAIN_EXTRACT_BADREADS]

    p15[CHAIN_BAD2UNIQ_READS]

    p16[CHAIN_FILTER_READS]

    p17[CHAIN_SORT_FIXMATE_BAM]
    p18[NON_CHAIN_REINDEX]

    m1[MERGE_REPLICATE_BAMS]

    p19[PEAK_CALLING]

    p20[BAM_COVERAGE_BIGWIG]

    p21[FRIP_READS_IN_PEAKS]

    p22[FINAL_CALC_FRIP]
    p23[PEAK_COVERAGE]

    p24[FEATURE_COUNTS]
    p25[FEATURE_COUNT2BED]

    p26[PARSE_FRAGMENT_LENGTH]
    o27([Fragment Length Plot]):::output

    p28[SORT_MARK_DUP_BAM]
    p29[CALC_PBC_METRICS]

    p30[MULTIQC]
 
    o1([Genomic BAM]):::output
    o2([Peak Coverage in BigWig]):::output
    o3([Peak Files]):::output
    o4([Count Matrix]):::output
    o5([Count Matrix in Bed format]):::output
    o6([MultiQC Report]):::output

    p1 --> p2

    subgraph alignment [  ]
    p2 --> p4
    p4 --> p5
    p5 --> p6
    p6 --> p7 
    p7 --> p9
    p9 --> p10
    p10 --> p11

    end

    subgraph chain_convert [  ]
    p11 -..-> |Non-B6 sample| p12
    p12 --> p13
    p13 --> p14
    p14 --> p15
    p15 --> p16
    p16 --> p17
    end

    p11 --> |B6 Samples| p18

    p17 -..-> o1
    p18 --> o1
    o1 -."If --merge_replicate".-> m1
    o1 --> p19
    o1 --> p20
    o1 --> p21
    m1 -..-> p19
    m1 -..-> p20
    m1 -..-> p21

    subgraph bigwig [  ]
    p20 --> o2
    end

    subgraph calling [   ]
    p19 --> o3
    p19 --> p23
    p23 --> p24
    p24 --> o4
    p24 --> p25
    p25 --> o5
    end


    subgraph qc [  ]
    p2 --> p3
    p7 --> p8
    p21 --> p22
    p9 --> p26
    p26 --> o27
    p6 --> p28
    p28 --> p29
    p3 --> p30
    p8 --> p30
    p22 --> p30
    p26 --> p30
    p29 --> p30
    o27 --> p30
    p30 --> o6
    end


classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000

style alignment stroke:#333,stroke-width:2px
style calling stroke:#333,stroke-width:2px
style chain_convert stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px
style bigwig stroke:#333,stroke-width:2px
Loading

ATAC - Human Flowchart

flowchart TD
	p1((Sample))

    p2[CUTADAPT]
    p3[FASTQC]
    p4[ALIGN_TRIMMED_FASTQ]

    p5[SORT_ALIGN_TRIM]
    p6[PICARD_MARKDUPLICATES]

    p7[REMOVE_DUPLICATE_READS]

    p8[CALC_MTDNA_FILTER_CHRM]

    p9[FILTER_REMOVE_MULTI_SHIFT]
    p10[FILTER_REMOVE_MULTI_SIEVE]

    p11[SORT_SHIFTED_BAM]
    
    m1[MERGE_REPLICATE_BAMS]

    p19[PEAK_CALLING]

    p21[FRIP_READS_IN_PEAKS]

    p22[FINAL_CALC_FRIP]
    p23[PEAK_COVERAGE]

    p24[FEATURE_COUNTS]

    p26[PARSE_FRAGMENT_LENGTH]
    o27([Fragment Length Plot]):::output

    p28[SORT_MARK_DUP_BAM]
    p29[CALC_PBC_METRICS]

    p30[MULTIQC]
 
    o1([Genomic BAM]):::output
    o3([Peak Files]):::output
    o4([Count Matrix]):::output
    o6([MultiQC Report]):::output

    p1 --> p2

    subgraph alignment [  ]
    p2 --> p4
    p4 --> p5
    p5 --> p6
    p6 --> p7 
    p7 --> p9
    p9 --> p10
    p10 --> p11

    end



    p11 --> o1
    o1 -."If --merge_replicate".-> m1
    m1 -..-> p19
    o1 --> p19
    m1 -..-> p21
    o1 --> p21



    subgraph calling [   ]
    p19 --> o3
    p19 --> p23
    p23 --> p24
    p24 --> o4
    end


    subgraph qc [  ]
    p2 --> p3
    p7 --> p8
    p21 --> p22
    p9 --> p26
    p26 --> o27
    p6 --> p28
    p28 --> p29
    p3 --> p30
    p8 --> p30
    p22 --> p30
    p26 --> p30
    p29 --> p30
    o27 --> p30
    p30 --> o6
    end


classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000

style alignment stroke:#333,stroke-width:2px
style calling stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px
Loading

Parameters for ATAC Pipeline

  • --pubdir

    • Default: /<PATH>
    • Comment: The directory that the saved outputs will be stored.
  • --organize_by

    • Default: sample
    • Comment: How to organize the output folder structure. Options: sample or analysis.
  • --cacheDir

    • Default: /projects/omics_share/meta/containers
    • Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
  • -w

    • Default: /<PATH>
    • Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
  • --sample_folder

    • Default: /<PATH>
    • Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
  • --extension

    • Default: .fastq.gz
    • Comment: The expected extension for the input read files.
  • --pattern

    • Default: "*_R{1,2}*"
    • Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this READ_NAME_R1_MoreText.fastq.gz or READ_NAME_R1.fastq.gz
  • --read_type

    • Default: PE
    • Comment: Options: PE and SE. Default: PE. Type of reads: paired end (PE) or single end (SE).
  • --concat_lanes

    • Default: false
    • Comment: Options: false and true. Default: false. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
  • --csv_input

    • Default: null
    • Comment: Provide a CSV manifest file with the header: "sampleID,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, * --download_data can be specified.
  • --download_data

    • Default: null
    • Comment: Requires * --csv_input. When specified, read data in the CSV manifest will be downloaded from provided URLs with Aria2.
  • --merge_replicates

    • Default: false
    • Comment: Options: false and true. Default: false. If this boolean is specified, BAM files will be merged prior to peak calling based on the replicate field in the csv_input file. See notes on CSV format below. This option must be called with --csv_input.
  • --gen_org

    • Default: mouse
    • Comment: Options: mouse and human.
  • --genome_build

    • Default: GRCm38
    • Comment: Mouse specific. Options: GRCm38 or GRCm39. If gen_org == human, build defaults to GRCh38.
  • --effective_genome_size

  • --chain

    • Default: null
    • Comment: Mouse Reference Strain chain file. Needed for final liftover to GRCm38 when running non-B6 samples. See RShiny App (https://rshiny.jax.org/omics_share/) for available values.
  • --bowtie2Index

    • Default: '/projects/omics_share/mouse/GRCm38/genome/indices/ensembl/v102/bowtie2/Mus_musculus.GRCm38.dna.primary_assembly.fa'
    • Comment: Pre-compiled Bowtie2 index file. Default is GRCm38 for mouse and GRCh38 for human. If using pseudo-reference, set appropriate chain file for final result liftover. See RShiny App (https://rshiny.jax.org/omics_share/) for available indicies.
  • --bowtieVSensitive

    • Default: true
    • Comment: Options: true and false. Default: true. When true, use very-sensitive Bowtie2 alignment setting.
  • --bowtieMaxInsert

    • Default: 1000
    • Comment: The maximum fragment length for valid paired-end alignments.
  • --cutadaptMinLength

    • Default: 20
    • Comment: The minimum length to discard processed reads.
  • --cutadaptQualCutoff

    • Default: 20
    • Comment: The quality cutoff used to trim low-quality ends from reads.
  • --cutadaptAdapterR1

    • Default: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
    • Comment: Specification of a 3’ adapter or a linked adapter. Default value is Illumina based adapter.
  • --cutadaptAdapterR2

    • Default: 'CTGTCTCTTATACACATCTGACGCTGCCGACGA'
    • Comment: Specification of a 5’ adapter or a linked adapter.
  • --tmpdir

    • Default: '/fastscratch/${USER}'
    • Comment: Temporary directory to store intermediate files generated outside of the standard Nextflow cache location.

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

Naming Convention Description
atac_report.html Nextflow autogenerated report.
trace.txt Nextflow trace of processes.
multiqc MultiQC report summarizing quality metrics across samples in the analysis run.
*.bigwig ATAC Peaks in BigWig format.
*countMatrix.mm10.bed Mouse only, the ATAC peak count matrix in true BED format, in GRCm38 (mm10) coordinates.
*countMatrix.txt Mouse: Count matrix in featureCounts output format, in GRCm38 (mm10) coordinates. Human: Count matrix in featureCounts output format, in GRCh38 (hg38) coordinates.
*narrowPeak Narrow Peak file from MACS2. See http://genome.ucsc.edu/FAQ/FAQformat.html#format12 for formatting.
*summits.bed Summits file from MACS2. A BED file which contains the peak summits locations for every peak. The 5th column in this file is the same as what is in the narrowPeak file.
summary_QC_metrics.txt A summary of QC metrics.
bam Directory containing alignments pre- and post-read shift.
stats Directory containing all individual stats files output by the pipeline.

CSV Input Sample Sheet

The required input header is: sampleID,lane,fastq_1,fastq_2. Samples can be provided either paired or un-paired.

  • The sampleID column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID.
  • The lane column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis.
  • The fastq_1 and fastq_2 columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.

Basic examples:

An example PE csv file:

sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz

An example SE csv file:

sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz

An example PE csv file with replicate information used when --merge_replicates is specified:

Note: Samples are first concatenated across lanes based on SampleID and replicate. If replicate is not assigned, concatenation will be done on sampleID alone. Replicate merging is then done post alignment across matching sampleIDs

sampleID,lane,fastq_1,fastq_2,replicate
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz,A
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz,A
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz,B
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz,B
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz,A
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz,

In the above example, Sample_42_A would be concatenated across 2 lanes, Sample_42_B would be concatenated across 2 lanes.
Following alignemnt, and post-alignment filtering Sample_42_A and Sample_42_B would be merged into Sample_42.
Sample_101_A and Sample_10191 would be treated independently.

Clone this wiki locally