-
Notifications
You must be signed in to change notification settings - Fork 10
RNA Pipeline ReadMe
• Step 1: Fastp read quality and adapter trimming
• Step 2: RSEM
• Step 3: Get Read Group Information
• Step 4: Picard Alignment Metrics
• Step 5: Summary Stats
• Step 6: MultiQC
If PDX (--pdx
):
• Step 1: Fastp read quality and adapter trimming
• Step 2: Xengsort human / mouse read disambiguation
• Step 2: RSEM on human reads
• Step 3: EBV-associated lymphoma classifier
• Step 4: Get Read Group Information for human reads
• Step 5: Picard Alignment Metrics for human reads
• Step 6: Summary Stats for human reads
• Step 2: RSEM on mouse reads
• Step 3: Get Read Group Information for mouse reads
• Step 4: Picard Alignment Metrics for mouse reads
• Step 5: Summary Stats for mouse reads
• Step 7: MultiQC report generation
flowchart TD
p0((Sample))
p1[FASTP]
p2[GET_READ_LENGTH]
p3[FASTQC]
p4[CHECK_STRANDEDNESS]
j1((Library\nInformation))
p6[RSEM_ALIGNMENT_EXPRESSION]
o1([Gene Counts]):::output
o2([Isoform Counts]):::output
o3([Genome BAM]):::output
o4([Isoform BAM]):::output
p8[READ_GROUPS]
p9[PICARD_ADDORREPLACEREADGROUPS]
p10[PICARD_REORDERSAM]
p11[PICARD_SORTSAM]
p12[PICARD_COLLECTRNASEQMETRICS]
p19[MULTIQC]
o10([MultiQC Report]):::output
p0 -->|Raw Reads| p1
subgraph top [ ]
p1 --> p2
p1 --> p4
p2 --> j1
p4 --> j1
end
j1 --> p6
subgraph human [ ]
p6 --> o1
p6 --> o2
p6 --> o3
p6 --> o4
o3 --> p8
p8 --> p9
p9 --> p10
p10 --> p11
end
subgraph qc [ ]
p1 --> p3
p11 --> p12
p3 --> p19
p6 --> p19
p12 --> p19
p19 --> o10
end
classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
style top stroke:#333,stroke-width:2px
style human stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px
flowchart TD
p0((Sample))
p1[FASTP]
p2[GET_READ_LENGTH]
p3[FASTQC]
p4[CHECK_STRANDEDNESS]
j1((Library\nInformation))
p5[XENGSORT_CLASSIFY]
p6[RSEM_ALIGNMENT_EXPRESSION_HUMAN]
o1([Human Gene Counts]):::output
o2([Human Isoform Counts]):::output
o3([Human Genome BAM]):::output
o4([Human Isoform BAM]):::output
p7[LYMPHOMA_CLASSIFIER]
o5([EBV Classifier Score]):::output
p8[READ_GROUPS_HUMAN]
p9[PICARD_ADDORREPLACEREADGROUPS_HUMAN]
p10[PICARD_REORDERSAM_HUMAN]
p11[PICARD_SORTSAM_HUMAN]
p12[PICARD_COLLECTRNASEQMETRICS_HUMAN]
p13[RSEM_ALIGNMENT_EXPRESSION_MOUSE]
o6([Mouse Gene Counts]):::output
o7([Mouse Isoform Counts]):::output
o8([Mouse Genome BAM]):::output
o9([Mouse Isoform BAM]):::output
p14[READ_GROUPS_MOUSE]
p15[PICARD_ADDORREPLACEREADGROUPS_MOUSE]
p16[PICARD_REORDERSAM_MOUSE]
p17[PICARD_SORTSAM_MOUSE]
p18[PICARD_COLLECTRNASEQMETRICS_MOUSE]
p19[MULTIQC]
o10([MultiQC Report]):::output
p0 -->|Raw Reads| p1
subgraph top [ ]
p1 --> p2
p1 --> p4
p1 --> p5
p2 --> j1
p4 --> j1
end
j1 --> p6
subgraph human [ ]
p5 -- "Xengsort:\nHuman Reads" --> p6
p6 --> o1
p6 --> o2
p6 --> o3
p6 --> o4
o1 --> p7
p7 --> o5
o3 --> p8
p8 --> p9
p9 --> p10
p10 --> p11
end
p5 -- "Xengsort:\nMouse Reads" --> p13
subgraph mouse [ ]
j1 --> p13
p13 --> o6
p13 --> o7
p13 --> o8
p13 --> o9
o8 --> p14
p14 --> p15
p15 --> p16
p16 --> p17
end
subgraph qc [ ]
p17 --> p18
p1 --> p3
p5 --> p19
p3 --> p19
p6 --> p19
p11 --> p12
p12 --> p19
p13 --> p19
p18 --> p19
p19 --> o10
end
classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
style top stroke:#333,stroke-width:2px
style human stroke:#333,stroke-width:2px
style mouse stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px
-
--pubdir
- Default:
/<PATH>
- Comment: The directory that the saved outputs will be stored.
- Default:
-
--organize_by
- Default:
sample
- Comment: How to organize the output folder structure. Options: sample or analysis.
- Default:
-
--cacheDir
- Default:
/projects/omics_share/meta/containers
- Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
- Default:
-
-w
- Default:
/<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
- Default:
-
--sample_folder
- Default:
/<PATH>
- Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
- Default:
-
--extension
- Default:
.fastq.gz
- Comment: The expected extension for the input read files.
- Default:
-
--pattern
- Default:
"*_R{1,2}*"
- Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this
READ_NAME_R1_MoreText.fastq.gz
orREAD_NAME_R1.fastq.gz
- Default:
-
--read_type
- Default:
PE
- Comment: Options:
PE
andSE
. Default:PE
. Type of reads: paired end (PE) or single end (SE).
- Default:
-
--concat_lanes
- Default:
false
- Comment: Options:
false
andtrue
. Default:false
. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
- Default:
-
--csv_input
- Default: null
- Comment: Provide a CSV manifest file with the header: "sampleID,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, *
--download_data
can be specified.
-
--download_data
- Default: null
- Comment: Requires *
--csv_input
. When specified, read data in the CSV manifest will be downloaded from provided URLs with Aria2.
-
--gen_org
- Default:
mouse
- Comment: Options:
mouse
andhuman
.
- Default:
-
--genome_build
- Default:
GRCm38
- Comment: Mouse specific. Options: GRCm38 or GRCm39. If gen_org == human, build defaults to GRCh38.
- Default:
-
--pdx
- Default:
false
- Comment: Options: false, true. If specified, 'Xengsort' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis.
- Default:
-
--classifier_table
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/rna_ebv_classifier/EBVlym_classifier_table_48.txt'
- Comment: EBV expected gene signatures used in EBV classifier. Only used when '--pdx' is run.
- Default:
-
--ref_fa
- Default:
'/projects/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.fasta'
- Comment: Xengsort graft fasta file. Used by Xengsort Index when
--pdx
is run, and xengsort_idx_path isnull
or false.
- Default:
-
--xengsort_host_fasta
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/imputed/rel_2112_v8/NOD_ShiLtJ.39.fa'
- Comment: Xengsort host fasta file. Used by Xengsort Index when
--pdx
is run, and xengsort_idx_path isnull
or false.
- Default:
-
--xengsort_idx_path
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/xengsort'
- Comment: Xengsort index for deconvolution of human and mouse reads. Used when
--pdx
is run. Ifnull
, Xengsort Index is run using ref_fa and host_fa.
- Default:
-
--xengsort_idx_name
- Default:
'hg38_GRCm39-NOD_ShiLtJ'
- Comment: Xengsort index name associated with files located in
xengsort_idx_path
or name given to outputs produced by Xengsort Index.
- Default:
-
--strandedness_ref
- Default:
'/projects/compsci/omics_share/mouse/GRCm38/transcriptome/indices/ensembl/v102/kallisto/kallisto_index'
- Comment: Modified kallisto index file used only in strandedness determination.
- Default:
-
--strandedness_gtf
- Default:
'/projects/compsci/omics_share/mouse/GRCm38/transcriptome/annotation/ensembl/v102/Mus_musculus.GRCm38.102.gtf'
- Comment: GTF file used with kallisto index file used only in strandedness determination.
- Default:
-
--strandedness
- Default:
null
- Comment: Library strandedness override. Supported options are 'reverse_stranded' or 'forward_stranded' or 'non_stranded'. This override parameter is only used when the tool
check_strandedness
fails to classify the strandedness of a sample. Ifcheck_strandedness
provides a strand determination, that setting is used.
- Default:
-
--quality_phred
- Default:
15
- Comment: The quality value that is required for a base to pass. Default: 15 which is a phred quality score of >=Q15.
- Default:
-
--unqualified_perc
- Default:
40
- Comment: Percent of bases that are allowed to be unqualified (0~100). Default: 40 which is 40%.
- Default:
-
--detect_adapter_for_pe
- Default:
false
- Comment: If true, adapter auto-detection is used for paired end data. By default, paired-end data adapter sequence auto-detection is disabled as the adapters can be trimmed by overlap analysis. However, --detect_adapter_for_pe will enable it. Fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
- Default:
-
--rsem_ref_files
- Default:
/projects/omics_share/mouse/GRCm38/transcriptome/indices/ensembl/v102/bowtie2
- Comment: Pre-compiled index files. Refers to human indices when *
--gen_org human
. JAX users should not change this, unless usingSTAR
indices.
- Default:
-
--rsem_ref_prefix
- Default:
'Mus_musculus.GRCm38.dna.primary_assembly'
- Comment: Prefix for index files. JAX users should not change this, unless using
STAR
indices. Refers to human indices when *--gen_org human
.
- Default:
-
--seed_length
- Default:
25
- Comment: "Seed length used by the read aligner. Providing the correct value is important for RSEM. If RSEM runs Bowtie, it uses this value for Bowtie's seed length parameter."
- Default:
-
--rsem_aligner
- Default:
'bowtie2'
- Comment: Options:
bowtie2
orstar
. The aligner algorithm used by RSEM. Note, if using STAR, pointrsem_ref_files
to STAR based indices.
- Default:
-
--merge_rna_counts
- Default:
false
- Comment: Options false, true. If specified, gene and transcript counts are merged across all samples. Typically used in multi-sample cases.
- Default:
-
--picard_dict
- Default:
'/projects/omics_share/mouse/GRCm38/genome/sequence/ensembl/v102/Mus_musculus.GRCm38.dna.toplevel.dict'
- Comment: The coverage metric calculation step requires this file. Refers to human assembly when *
--gen_org human
. JAX users should not change this parameter.
- Default:
-
--ref_flat
- Default:
'/projects/omics_share/mouse/GRCm38/transcriptome/annotation/ensembl/v102/Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.refFlat.txt'
- Comment: The coverage metric calculation step requires this file. Refers to human assembly when *
--gen_org human
. JAX users should not change this parameter.
- Default:
-
--ribo_intervals
- Default:
'/projects/omics_share/mouse/GRCm38/transcriptome/annotation/ensembl/v102/Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.rRNA.interval_list'
- Comment: The coverage metric calculation step requires this file. Refers to human assembly when *
--gen_org human
. JAX users should not change this parameter.
- Default:
NOTE: *
Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.
Naming Convention | Description |
---|---|
rnaseq_report.html |
Nextflow autogenerated report |
trace.txt |
Nextflow trace of processes |
multiqc |
MultiQC report summarizing quality metrics across samples in the analysis run. |
*/bam/*.genome.bam |
Output of rsem calculate expression module process RSEM_ALIGNMENT_EXPRESSION |
*/bam/*.transcript.bam |
Output of rsem calculate expression module process RSEM_ALIGNMENT_EXPRESSION |
*/stats/*.genes.results |
Output of rsem calculate expression module process RSEM_ALIGNMENT_EXPRESSION |
*/stats/*.isoforms.results |
Output of rsem calculate expression module process RSEM_ALIGNMENT_EXPRESSION |
*/stats/*.fastq.gz_stat |
Statistics output from quality trimming using Jax Timmer process |
*/stats/rsem_aln_*.stats |
Statistics output from RSEM_ALIGNMENT_EXPRESSION |
*/stats/*_read_group.txt |
Read group information from sample processed. |
NOTE: If --pdx
is run, sample output directories vary slightly from above. Three output directories per sample are generated:
-
<SAMPLE_ID>
: with allstats
generated for the sample as above with the addition of the following file:
Naming Convention | Description |
---|---|
*xengsort_log.txt |
Xengsort statistics file |
-
<SAMPLE_ID>_human
with all human specific quantification (e.g., genes.results) and alignments (e.g, genome.bam) as above: -
<SAMPLE_ID>_mouse
with all mouse specific quantification and outputs
These output will only be saved when --keep_intermediate true
is specified.
Naming Convention | Description (--keep_intermediate true ) |
---|---|
*/*_read_group.txt |
Read groups for fastq files from READ_GROUPS |
*/bam/*_genome_bam_with_read_group_reorder.bam |
From PICARD_REORDERSAM |
*/bam/*_genome_bam_with_read_groups.bam |
From PICARD_ADDORREPLACEREADGROUPS |
*/bam/*_sortsam.bam |
From PICARD_SORTSAM |
The required input header is: sampleID,lane,fastq_1,fastq_2
. Samples can be provided either paired or un-paired.
- The
sampleID
column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID. - The
lane
column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis. - The
fastq_1
andfastq_2
columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.
sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz
sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz