RNA Pipeline ReadMe

RNA sequencing Documentation

RNA-Sequencing Pipeline (`--workflow rnaseq`)

•	Step 1: Fastp read quality and adapter trimming   
•	Step 2: RSEM  
•	Step 3: Get Read Group Information  
•	Step 4: Picard Alignment Metrics  
•	Step 5: Summary Stats  
•	Step 6: MultiQC

If PDX (--pdx):

•	Step 1: Fastp read quality and adapter trimming   
•	Step 2: Xengsort human / mouse read disambiguation   
•	Step 2: RSEM on human reads   
•	Step 3: EBV-associated lymphoma classifier 
•	Step 4: Get Read Group Information for human reads  
•	Step 5: Picard Alignment Metrics for human reads  
•	Step 6: Summary Stats for human reads  
•	Step 2: RSEM on mouse reads  
•	Step 3: Get Read Group Information for mouse reads  
•	Step 4: Picard Alignment Metrics for mouse reads  
•	Step 5: Summary Stats for mouse reads  
•	Step 7: MultiQC report generation

RNA - Flowchart

flowchart TD
    p0((Sample))
    p1[FASTP]
    p2[GET_READ_LENGTH]
    p3[FASTQC]
    p4[CHECK_STRANDEDNESS]
    j1((Library\nInformation))
    p6[RSEM_ALIGNMENT_EXPRESSION]
    o1([Gene Counts]):::output
    o2([Isoform Counts]):::output
    o3([Genome BAM]):::output
    o4([Isoform BAM]):::output
    p8[READ_GROUPS]
    p9[PICARD_ADDORREPLACEREADGROUPS]
    p10[PICARD_REORDERSAM]
    p11[PICARD_SORTSAM]
    p12[PICARD_COLLECTRNASEQMETRICS]

    p19[MULTIQC]
    o10([MultiQC Report]):::output
    p0 -->|Raw Reads| p1
    subgraph top [  ]
    p1 --> p2
    p1 --> p4
    p2 --> j1
    p4 --> j1
    end
    j1 --> p6

    subgraph human [  ]

    p6 --> o1
    p6 --> o2
    p6 --> o3
    p6 --> o4
    o3 --> p8
    p8 --> p9
    p9 --> p10
    p10 --> p11
    end
    
    
    subgraph qc [   ]
    p1 --> p3
    p11 --> p12
    p3 --> p19
    p6 --> p19
    p12 --> p19
    p19 --> o10
    end

classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000

style top stroke:#333,stroke-width:2px
style human stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px

RNA - PDX Flowchart

flowchart TD
    p0((Sample))
    p1[FASTP]
    p2[GET_READ_LENGTH]
    p3[FASTQC]
    p4[CHECK_STRANDEDNESS]
    j1((Library\nInformation))
    p5[XENGSORT_CLASSIFY]
    p6[RSEM_ALIGNMENT_EXPRESSION_HUMAN]
    o1([Human Gene Counts]):::output
    o2([Human Isoform Counts]):::output
    o3([Human Genome BAM]):::output
    o4([Human Isoform BAM]):::output
    p7[LYMPHOMA_CLASSIFIER]
    o5([EBV Classifier Score]):::output
    p8[READ_GROUPS_HUMAN]
    p9[PICARD_ADDORREPLACEREADGROUPS_HUMAN]
    p10[PICARD_REORDERSAM_HUMAN]
    p11[PICARD_SORTSAM_HUMAN]
    p12[PICARD_COLLECTRNASEQMETRICS_HUMAN]
    p13[RSEM_ALIGNMENT_EXPRESSION_MOUSE]
    o6([Mouse Gene Counts]):::output
    o7([Mouse Isoform Counts]):::output
    o8([Mouse Genome BAM]):::output
    o9([Mouse Isoform BAM]):::output
    p14[READ_GROUPS_MOUSE]
    p15[PICARD_ADDORREPLACEREADGROUPS_MOUSE]
    p16[PICARD_REORDERSAM_MOUSE]
    p17[PICARD_SORTSAM_MOUSE]
    p18[PICARD_COLLECTRNASEQMETRICS_MOUSE]
    p19[MULTIQC]
    o10([MultiQC Report]):::output
    
    p0 -->|Raw Reads| p1
    subgraph top [  ]
    p1 --> p2
    p1 --> p4
    p1 --> p5
    p2 --> j1
    p4 --> j1
    end
    j1 --> p6

    subgraph human [  ]
    p5 -- "Xengsort:\nHuman Reads" --> p6
    p6 --> o1
    p6 --> o2
    p6 --> o3
    p6 --> o4
    o1 --> p7
    p7 --> o5
    o3 --> p8
    p8 --> p9
    p9 --> p10
    p10 --> p11

    end
    p5 -- "Xengsort:\nMouse Reads" --> p13
    subgraph mouse [  ]
    j1 --> p13
    p13 --> o6
    p13 --> o7
    p13 --> o8
    p13 --> o9
    o8 --> p14
    p14 --> p15
    p15 --> p16
    p16 --> p17

    end
    subgraph qc [   ]
    p17 --> p18
    p1 --> p3
    p5 --> p19
    p3 --> p19
    p6 --> p19
	p11 --> p12
    p12 --> p19
    p13 --> p19
    p18 --> p19
    p19 --> o10
    end

classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000

style top stroke:#333,stroke-width:2px
style human stroke:#333,stroke-width:2px
style mouse stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px

Parameters for RNA-seq Pipeline

--pubdir
- Default: /<PATH>
- Comment: The directory that the saved outputs will be stored.
--organize_by
- Default: sample
- Comment: How to organize the output folder structure. Options: sample or analysis.
--cacheDir
- Default: /projects/omics_share/meta/containers
- Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
-w
- Default: /<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
--sample_folder
- Default: /<PATH>
- Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
--extension
- Default: .fastq.gz
- Comment: The expected extension for the input read files.
--pattern
- Default: "*_R{1,2}*"
- Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this READ_NAME_R1_MoreText.fastq.gz or READ_NAME_R1.fastq.gz
--read_type
- Default: PE
- Comment: Options: PE and SE. Default: PE. Type of reads: paired end (PE) or single end (SE).
--concat_lanes
- Default: false
- Comment: Options: false and true. Default: false. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
--csv_input
- Default: null
- Comment: Provide a CSV manifest file with the header: "sampleID,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, * --download_data can be specified.
--download_data
- Default: null
- Comment: Requires * --csv_input. When specified, read data in the CSV manifest will be downloaded from provided URLs with Aria2.
--gen_org
- Default: mouse
- Comment: Options: mouse and human.
--genome_build
- Default: GRCm38
- Comment: Mouse specific. Options: GRCm38 or GRCm39. If gen_org == human, build defaults to GRCh38.
--pdx
- Default: false
- Comment: Options: false, true. If specified, 'Xengsort' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis.
--classifier_table
- Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/rna_ebv_classifier/EBVlym_classifier_table_48.txt'
- Comment: EBV expected gene signatures used in EBV classifier. Only used when '--pdx' is run.
--ref_fa
- Default: '/projects/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.fasta'
- Comment: Xengsort graft fasta file. Used by Xengsort Index when --pdx is run, and xengsort_idx_path is null or false.
--xengsort_host_fasta
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/imputed/rel_2112_v8/NOD_ShiLtJ.39.fa'
- Comment: Xengsort host fasta file. Used by Xengsort Index when --pdx is run, and xengsort_idx_path is null or false.
--xengsort_idx_path
- Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/xengsort'
- Comment: Xengsort index for deconvolution of human and mouse reads. Used when --pdx is run. If null, Xengsort Index is run using ref_fa and host_fa.
--xengsort_idx_name
- Default: 'hg38_GRCm39-NOD_ShiLtJ'
- Comment: Xengsort index name associated with files located in xengsort_idx_path or name given to outputs produced by Xengsort Index.
--strandedness_ref
- Default: '/projects/compsci/omics_share/mouse/GRCm38/transcriptome/indices/ensembl/v102/kallisto/kallisto_index'
- Comment: Modified kallisto index file used only in strandedness determination.
--strandedness_gtf
- Default: '/projects/compsci/omics_share/mouse/GRCm38/transcriptome/annotation/ensembl/v102/Mus_musculus.GRCm38.102.gtf'
- Comment: GTF file used with kallisto index file used only in strandedness determination.
--strandedness
- Default: null
- Comment: Library strandedness override. Supported options are 'reverse_stranded' or 'forward_stranded' or 'non_stranded'. This override parameter is only used when the tool check_strandedness fails to classify the strandedness of a sample. If check_strandedness provides a strand determination, that setting is used.
--quality_phred
- Default: 15
- Comment: The quality value that is required for a base to pass. Default: 15 which is a phred quality score of >=Q15.
--unqualified_perc
- Default: 40
- Comment: Percent of bases that are allowed to be unqualified (0~100). Default: 40 which is 40%.
--detect_adapter_for_pe
- Default: false
- Comment: If true, adapter auto-detection is used for paired end data. By default, paired-end data adapter sequence auto-detection is disabled as the adapters can be trimmed by overlap analysis. However, --detect_adapter_for_pe will enable it. Fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
--rsem_ref_files
- Default: /projects/omics_share/mouse/GRCm38/transcriptome/indices/ensembl/v102/bowtie2
- Comment: Pre-compiled index files. Refers to human indices when * --gen_org human. JAX users should not change this, unless using STAR indices.
--rsem_ref_prefix
- Default: 'Mus_musculus.GRCm38.dna.primary_assembly'
- Comment: Prefix for index files. JAX users should not change this, unless using STAR indices. Refers to human indices when * --gen_org human.
--seed_length
- Default: 25
- Comment: "Seed length used by the read aligner. Providing the correct value is important for RSEM. If RSEM runs Bowtie, it uses this value for Bowtie's seed length parameter."
--rsem_aligner
- Default: 'bowtie2'
- Comment: Options: bowtie2 or star. The aligner algorithm used by RSEM. Note, if using STAR, point rsem_ref_files to STAR based indices.
--merge_rna_counts
- Default: false
- Comment: Options false, true. If specified, gene and transcript counts are merged across all samples. Typically used in multi-sample cases.
--picard_dict
- Default: '/projects/omics_share/mouse/GRCm38/genome/sequence/ensembl/v102/Mus_musculus.GRCm38.dna.toplevel.dict'
- Comment: The coverage metric calculation step requires this file. Refers to human assembly when * --gen_org human. JAX users should not change this parameter.
--ref_flat
- Default: '/projects/omics_share/mouse/GRCm38/transcriptome/annotation/ensembl/v102/Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.refFlat.txt'
- Comment: The coverage metric calculation step requires this file. Refers to human assembly when * --gen_org human. JAX users should not change this parameter.
--ribo_intervals
- Default: '/projects/omics_share/mouse/GRCm38/transcriptome/annotation/ensembl/v102/Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.rRNA.interval_list'
- Comment: The coverage metric calculation step requires this file. Refers to human assembly when * --gen_org human. JAX users should not change this parameter.

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

Naming Convention	Description
`rnaseq_report.html`	Nextflow autogenerated report
`trace.txt`	Nextflow trace of processes
`multiqc`	MultiQC report summarizing quality metrics across samples in the analysis run.
`/bam/.genome.bam`	Output of rsem calculate expression module process RSEM_ALIGNMENT_EXPRESSION
`/bam/.transcript.bam`	Output of rsem calculate expression module process RSEM_ALIGNMENT_EXPRESSION
`/stats/.genes.results`	Output of rsem calculate expression module process RSEM_ALIGNMENT_EXPRESSION
`/stats/.isoforms.results`	Output of rsem calculate expression module process RSEM_ALIGNMENT_EXPRESSION
`/stats/.fastq.gz_stat`	Statistics output from quality trimming using Jax Timmer process
`/stats/rsem_aln_.stats`	Statistics output from RSEM_ALIGNMENT_EXPRESSION
`/stats/_read_group.txt`	Read group information from sample processed.

PDX Outputs

NOTE: If --pdx is run, sample output directories vary slightly from above. Three output directories per sample are generated:

<SAMPLE_ID>: with all stats generated for the sample as above with the addition of the following file:

Naming Convention	Description
`*xengsort_log.txt`	Xengsort statistics file

<SAMPLE_ID>_human with all human specific quantification (e.g., genes.results) and alignments (e.g, genome.bam) as above:
<SAMPLE_ID>_mouse with all mouse specific quantification and outputs

Pipeline Optional Outputs

These output will only be saved when --keep_intermediate true is specified.

Naming Convention	Description (`--keep_intermediate true`)
`/_read_group.txt`	Read groups for fastq files from READ_GROUPS
`/bam/_genome_bam_with_read_group_reorder.bam`	From PICARD_REORDERSAM
`/bam/_genome_bam_with_read_groups.bam`	From PICARD_ADDORREPLACEREADGROUPS
`/bam/_sortsam.bam`	From PICARD_SORTSAM

CSV Input Sample Sheet

The required input header is: sampleID,lane,fastq_1,fastq_2. Samples can be provided either paired or un-paired.

The sampleID column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID.
The lane column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis.
The fastq_1 and fastq_2 columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.

Basic examples:

An example PE csv file:

sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz

An example SE csv file:

sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RNA Pipeline ReadMe

RNA sequencing Documentation

RNA-Sequencing Pipeline (`--workflow rnaseq`)

RNA - Flowchart

RNA - PDX Flowchart

Parameters for RNA-seq Pipeline

Pipeline Default Outputs

PDX Outputs

Pipeline Optional Outputs

CSV Input Sample Sheet

Basic examples:

An example PE csv file:

An example SE csv file:

Home

Pipeline Documentation

Benchmarking Documentation

Pipeline development and Release Documentation

Clone this wiki locally

RNA Pipeline ReadMe

RNA sequencing Documentation

RNA-Sequencing Pipeline (--workflow rnaseq)

RNA - Flowchart

RNA - PDX Flowchart

Parameters for RNA-seq Pipeline

Pipeline Default Outputs

PDX Outputs

Pipeline Optional Outputs

CSV Input Sample Sheet

Basic examples:

An example PE csv file:

An example SE csv file:

Home

Pipeline Documentation

Benchmarking Documentation

Pipeline development and Release Documentation

Clone this wiki locally

RNA-Sequencing Pipeline (`--workflow rnaseq`)