-
Notifications
You must be signed in to change notification settings - Fork 10
ATAC Pipeline ReadMe
• Step 1: Trim FASTQ reads
• Step 2: FASTQC quality statistics
• Step 3: Align to reference
• Step 4: Sort alignments
• Step 5: Filter duplicate reads from alignments
• Step 6: Calculate mitochondrial percentage and remove MT reads.
• Step 7: Filter non-unique and include only 'properly mapped reads' alignments
• Step 8: Run deeptools alignmentSieve to offset reads by +4 bp on + strand, and −5 bp on the – strand. Tn5 transposase has been shown to bind as a dimer and insert two adaptors separated by 9 bp.
• Step 9: Re-sort shifted bam
Mouse - NON-B6 Strains:
• Step 10: Liftover mapped read coordinates using chain file. Sort bam by coordinates, and remove reads that failed to liftover. Sort and Re-index lifted BAM.
Mouse - B6 Strains:
• Step 10: Index BAM file
Human:
• Step 10: Index BAM file
Both:
• Step 11: Peak calling with MACS2
• Step 12: Calculate fraction of reads in peak, peak depths, and final QC metrics.
• Step 13: MultiQC report generation
flowchart TD
p1((Sample))
p2[CUTADAPT]
p3[FASTQC]
p4[ALIGN_TRIMMED_FASTQ]
p5[SORT_ALIGN_TRIM]
p6[PICARD_MARKDUPLICATES]
p7[REMOVE_DUPLICATE_READS]
p8[CALC_MTDNA_FILTER_CHRM]
p9[FILTER_REMOVE_MULTI_SHIFT]
p10[FILTER_REMOVE_MULTI_SIEVE]
p11[SORT_SHIFTED_BAM]
p12[CHAIN_CONVERT]
p13[SORT_LIFTOVER_BAM]
p14[CHAIN_EXTRACT_BADREADS]
p15[CHAIN_BAD2UNIQ_READS]
p16[CHAIN_FILTER_READS]
p17[CHAIN_SORT_FIXMATE_BAM]
p18[NON_CHAIN_REINDEX]
m1[MERGE_REPLICATE_BAMS]
p19[PEAK_CALLING]
p20[BAM_COVERAGE_BIGWIG]
p21[FRIP_READS_IN_PEAKS]
p22[FINAL_CALC_FRIP]
p23[PEAK_COVERAGE]
p24[FEATURE_COUNTS]
p25[FEATURE_COUNT2BED]
p26[PARSE_FRAGMENT_LENGTH]
o27([Fragment Length Plot]):::output
p28[SORT_MARK_DUP_BAM]
p29[CALC_PBC_METRICS]
p30[MULTIQC]
o1([Genomic BAM]):::output
o2([Peak Coverage in BigWig]):::output
o3([Peak Files]):::output
o4([Count Matrix]):::output
o5([Count Matrix in Bed format]):::output
o6([MultiQC Report]):::output
p1 --> p2
subgraph alignment [ ]
p2 --> p4
p4 --> p5
p5 --> p6
p6 --> p7
p7 --> p9
p9 --> p10
p10 --> p11
end
subgraph chain_convert [ ]
p11 -..-> |Non-B6 sample| p12
p12 --> p13
p13 --> p14
p14 --> p15
p15 --> p16
p16 --> p17
end
p11 --> |B6 Samples| p18
p17 -..-> o1
p18 --> o1
o1 -."If --merge_replicate".-> m1
o1 --> p19
o1 --> p20
o1 --> p21
m1 -..-> p19
m1 -..-> p20
m1 -..-> p21
subgraph bigwig [ ]
p20 --> o2
end
subgraph calling [ ]
p19 --> o3
p19 --> p23
p23 --> p24
p24 --> o4
p24 --> p25
p25 --> o5
end
subgraph qc [ ]
p2 --> p3
p7 --> p8
p21 --> p22
p9 --> p26
p26 --> o27
p6 --> p28
p28 --> p29
p3 --> p30
p8 --> p30
p22 --> p30
p26 --> p30
p29 --> p30
o27 --> p30
p30 --> o6
end
classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
style alignment stroke:#333,stroke-width:2px
style calling stroke:#333,stroke-width:2px
style chain_convert stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px
style bigwig stroke:#333,stroke-width:2px
flowchart TD
p1((Sample))
p2[CUTADAPT]
p3[FASTQC]
p4[ALIGN_TRIMMED_FASTQ]
p5[SORT_ALIGN_TRIM]
p6[PICARD_MARKDUPLICATES]
p7[REMOVE_DUPLICATE_READS]
p8[CALC_MTDNA_FILTER_CHRM]
p9[FILTER_REMOVE_MULTI_SHIFT]
p10[FILTER_REMOVE_MULTI_SIEVE]
p11[SORT_SHIFTED_BAM]
m1[MERGE_REPLICATE_BAMS]
p19[PEAK_CALLING]
p21[FRIP_READS_IN_PEAKS]
p22[FINAL_CALC_FRIP]
p23[PEAK_COVERAGE]
p24[FEATURE_COUNTS]
p26[PARSE_FRAGMENT_LENGTH]
o27([Fragment Length Plot]):::output
p28[SORT_MARK_DUP_BAM]
p29[CALC_PBC_METRICS]
p30[MULTIQC]
o1([Genomic BAM]):::output
o3([Peak Files]):::output
o4([Count Matrix]):::output
o6([MultiQC Report]):::output
p1 --> p2
subgraph alignment [ ]
p2 --> p4
p4 --> p5
p5 --> p6
p6 --> p7
p7 --> p9
p9 --> p10
p10 --> p11
end
p11 --> o1
o1 -."If --merge_replicate".-> m1
m1 -..-> p19
o1 --> p19
m1 -..-> p21
o1 --> p21
subgraph calling [ ]
p19 --> o3
p19 --> p23
p23 --> p24
p24 --> o4
end
subgraph qc [ ]
p2 --> p3
p7 --> p8
p21 --> p22
p9 --> p26
p26 --> o27
p6 --> p28
p28 --> p29
p3 --> p30
p8 --> p30
p22 --> p30
p26 --> p30
p29 --> p30
o27 --> p30
p30 --> o6
end
classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
style alignment stroke:#333,stroke-width:2px
style calling stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px
-
--pubdir
- Default:
/<PATH>
- Comment: The directory that the saved outputs will be stored.
- Default:
-
--organize_by
- Default:
sample
- Comment: How to organize the output folder structure. Options: sample or analysis.
- Default:
-
--cacheDir
- Default:
/projects/omics_share/meta/containers
- Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
- Default:
-
-w
- Default:
/<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
- Default:
-
--sample_folder
- Default:
/<PATH>
- Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
- Default:
-
--extension
- Default:
.fastq.gz
- Comment: The expected extension for the input read files.
- Default:
-
--pattern
- Default:
"*_R{1,2}*"
- Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this
READ_NAME_R1_MoreText.fastq.gz
orREAD_NAME_R1.fastq.gz
- Default:
-
--read_type
- Default:
PE
- Comment: Options:
PE
andSE
. Default:PE
. Type of reads: paired end (PE) or single end (SE).
- Default:
-
--concat_lanes
- Default:
false
- Comment: Options:
false
andtrue
. Default:false
. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
- Default:
-
--csv_input
- Default: null
- Comment: Provide a CSV manifest file with the header: "sampleID,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, *
--download_data
can be specified.
-
--download_data
- Default: null
- Comment: Requires *
--csv_input
. When specified, read data in the CSV manifest will be downloaded from provided URLs with Aria2.
-
--merge_replicates
- Default:
false
- Comment: Options:
false
andtrue
. Default:false
. If this boolean is specified, BAM files will be merged prior to peak calling based on thereplicate
field in thecsv_input
file. See notes on CSV format below. This option must be called with--csv_input
.
- Default:
-
--gen_org
- Default:
mouse
- Comment: Options:
mouse
andhuman
.
- Default:
-
--genome_build
- Default:
GRCm38
- Comment: Mouse specific. Options: GRCm38 or GRCm39. If gen_org == human, build defaults to GRCh38.
- Default:
-
--effective_genome_size
- Default:
2652783500
- Comment: The length of the “mappable” genome. Mouse only - Please see : 'https://deeptools.readthedocs.io/en/develop/content/feature/effectiveGenomeSize.html'.
- Default:
-
--chain
- Default:
null
- Comment: Mouse Reference Strain chain file. Needed for final liftover to GRCm38 when running non-B6 samples. See RShiny App (https://rshiny.jax.org/omics_share/) for available values.
- Default:
-
--bowtie2Index
- Default:
'/projects/omics_share/mouse/GRCm38/genome/indices/ensembl/v102/bowtie2/Mus_musculus.GRCm38.dna.primary_assembly.fa'
- Comment: Pre-compiled Bowtie2 index file. Default is GRCm38 for mouse and GRCh38 for human. If using pseudo-reference, set appropriate chain file for final result liftover. See RShiny App (https://rshiny.jax.org/omics_share/) for available indicies.
- Default:
-
--bowtieVSensitive
- Default:
true
- Comment: Options:
true
andfalse
. Default:true
. Whentrue
, usevery-sensitive
Bowtie2 alignment setting.
- Default:
-
--bowtieMaxInsert
- Default:
1000
- Comment: The maximum fragment length for valid paired-end alignments.
- Default:
-
--cutadaptMinLength
- Default:
20
- Comment: The minimum length to discard processed reads.
- Default:
-
--cutadaptQualCutoff
- Default:
20
- Comment: The quality cutoff used to trim low-quality ends from reads.
- Default:
-
--cutadaptAdapterR1
- Default:
'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
- Comment: Specification of a 3’ adapter or a linked adapter. Default value is Illumina based adapter.
- Default:
-
--cutadaptAdapterR2
- Default:
'CTGTCTCTTATACACATCTGACGCTGCCGACGA'
- Comment: Specification of a 5’ adapter or a linked adapter.
- Default:
-
--tmpdir
- Default:
'/fastscratch/${USER}'
- Comment: Temporary directory to store intermediate files generated outside of the standard Nextflow cache location.
- Default:
NOTE: *
Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.
Naming Convention | Description |
---|---|
atac_report.html |
Nextflow autogenerated report. |
trace.txt |
Nextflow trace of processes. |
multiqc |
MultiQC report summarizing quality metrics across samples in the analysis run. |
*.bigwig |
ATAC Peaks in BigWig format. |
*countMatrix.mm10.bed |
Mouse only, the ATAC peak count matrix in true BED format, in GRCm38 (mm10) coordinates. |
*countMatrix.txt |
Mouse: Count matrix in featureCounts output format, in GRCm38 (mm10) coordinates. Human: Count matrix in featureCounts output format, in GRCh38 (hg38) coordinates. |
*narrowPeak |
Narrow Peak file from MACS2. See http://genome.ucsc.edu/FAQ/FAQformat.html#format12 for formatting. |
*summits.bed |
Summits file from MACS2. A BED file which contains the peak summits locations for every peak. The 5th column in this file is the same as what is in the narrowPeak file. |
summary_QC_metrics.txt |
A summary of QC metrics. |
bam |
Directory containing alignments pre- and post-read shift. |
stats |
Directory containing all individual stats files output by the pipeline. |
The required input header is: sampleID,lane,fastq_1,fastq_2
. Samples can be provided either paired or un-paired.
- The
sampleID
column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID. - The
lane
column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis. - The
fastq_1
andfastq_2
columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.
sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz
sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz
Note: Samples are first concatenated across lanes based on SampleID
and replicate
. If replicate
is not assigned, concatenation will be done on sampleID
alone. Replicate merging is then done post alignment across matching sampleID
s
sampleID,lane,fastq_1,fastq_2,replicate
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz,A
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz,A
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz,B
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz,B
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz,A
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz,
In the above example, Sample_42_A
would be concatenated across 2 lanes, Sample_42_B
would be concatenated across 2 lanes.
Following alignemnt, and post-alignment filtering Sample_42_A
and Sample_42_B
would be merged into Sample_42
.
Sample_101_A
and Sample_10191
would be treated independently.