Skip to content

RNA Pipeline ReadMe

MikeWLloyd edited this page Jul 3, 2024 · 31 revisions

RNA sequencing Documentation

RNA-Sequencing Pipeline (--workflow rnaseq)

•	Step 1: Fastp read quality and adapter trimming   
•	Step 2: RSEM  
•	Step 3: Get Read Group Information  
•	Step 4: Picard Alignment Metrics  
•	Step 5: Summary Stats  
•	Step 6: MultiQC 

If PDX (--pdx):

•	Step 1: Fastp read quality and adapter trimming   
•	Step 2: Xengsort human / mouse read disambiguation   
•	Step 2: RSEM on human reads   
•	Step 3: EBV-associated lymphoma classifier 
•	Step 4: Get Read Group Information for human reads  
•	Step 5: Picard Alignment Metrics for human reads  
•	Step 6: Summary Stats for human reads  
•	Step 2: RSEM on mouse reads  
•	Step 3: Get Read Group Information for mouse reads  
•	Step 4: Picard Alignment Metrics for mouse reads  
•	Step 5: Summary Stats for mouse reads  
•	Step 7: MultiQC report generation 

RNA - Flowchart

flowchart TD
    p0((Sample))
    p1[FASTP]
    p2[GET_READ_LENGTH]
    p3[FASTQC]
    p4[CHECK_STRANDEDNESS]
    j1((Library\nInformation))
    p6[RSEM_ALIGNMENT_EXPRESSION]
    o1([Gene Counts]):::output
    o2([Isoform Counts]):::output
    o3([Genome BAM]):::output
    o4([Isoform BAM]):::output
    p8[READ_GROUPS]
    p9[PICARD_ADDORREPLACEREADGROUPS]
    p10[PICARD_REORDERSAM]
    p11[PICARD_SORTSAM]
    p12[PICARD_COLLECTRNASEQMETRICS]

    p19[MULTIQC]
    o10([MultiQC Report]):::output
    p0 -->|Raw Reads| p1
    subgraph top [  ]
    p1 --> p2
    p1 --> p4
    p2 --> j1
    p4 --> j1
    end
    j1 --> p6

    subgraph human [  ]

    p6 --> o1
    p6 --> o2
    p6 --> o3
    p6 --> o4
    o3 --> p8
    p8 --> p9
    p9 --> p10
    p10 --> p11
    end
    
    
    subgraph qc [   ]
    p1 --> p3
    p11 --> p12
    p3 --> p19
    p6 --> p19
    p12 --> p19
    p19 --> o10
    end

classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000

style top stroke:#333,stroke-width:2px
style human stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px
Loading

RNA - PDX Flowchart

flowchart TD
    p0((Sample))
    p1[FASTP]
    p2[GET_READ_LENGTH]
    p3[FASTQC]
    p4[CHECK_STRANDEDNESS]
    j1((Library\nInformation))
    p5[XENGSORT_CLASSIFY]
    p6[RSEM_ALIGNMENT_EXPRESSION_HUMAN]
    o1([Human Gene Counts]):::output
    o2([Human Isoform Counts]):::output
    o3([Human Genome BAM]):::output
    o4([Human Isoform BAM]):::output
    p7[LYMPHOMA_CLASSIFIER]
    o5([EBV Classifier Score]):::output
    p8[READ_GROUPS_HUMAN]
    p9[PICARD_ADDORREPLACEREADGROUPS_HUMAN]
    p10[PICARD_REORDERSAM_HUMAN]
    p11[PICARD_SORTSAM_HUMAN]
    p12[PICARD_COLLECTRNASEQMETRICS_HUMAN]
    p13[RSEM_ALIGNMENT_EXPRESSION_MOUSE]
    o6([Mouse Gene Counts]):::output
    o7([Mouse Isoform Counts]):::output
    o8([Mouse Genome BAM]):::output
    o9([Mouse Isoform BAM]):::output
    p14[READ_GROUPS_MOUSE]
    p15[PICARD_ADDORREPLACEREADGROUPS_MOUSE]
    p16[PICARD_REORDERSAM_MOUSE]
    p17[PICARD_SORTSAM_MOUSE]
    p18[PICARD_COLLECTRNASEQMETRICS_MOUSE]
    p19[MULTIQC]
    o10([MultiQC Report]):::output
    
    p0 -->|Raw Reads| p1
    subgraph top [  ]
    p1 --> p2
    p1 --> p4
    p1 --> p5
    p2 --> j1
    p4 --> j1
    end
    j1 --> p6

    subgraph human [  ]
    p5 -- "Xengsort:\nHuman Reads" --> p6
    p6 --> o1
    p6 --> o2
    p6 --> o3
    p6 --> o4
    o1 --> p7
    p7 --> o5
    o3 --> p8
    p8 --> p9
    p9 --> p10
    p10 --> p11

    end
    p5 -- "Xengsort:\nMouse Reads" --> p13
    subgraph mouse [  ]
    j1 --> p13
    p13 --> o6
    p13 --> o7
    p13 --> o8
    p13 --> o9
    o8 --> p14
    p14 --> p15
    p15 --> p16
    p16 --> p17

    end
    subgraph qc [   ]
    p17 --> p18
    p1 --> p3
    p5 --> p19
    p3 --> p19
    p6 --> p19
	p11 --> p12
    p12 --> p19
    p13 --> p19
    p18 --> p19
    p19 --> o10
    end

classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000

style top stroke:#333,stroke-width:2px
style human stroke:#333,stroke-width:2px
style mouse stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px
Loading

Parameters for RNA-seq Pipeline

  • --pubdir

    • Default: /<PATH>
    • Comment: The directory that the saved outputs will be stored.
  • --organize_by

    • Default: sample
    • Comment: How to organize the output folder structure. Options: sample or analysis.
  • --cacheDir

    • Default: /projects/omics_share/meta/containers
    • Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
  • -w

    • Default: /<PATH>
    • Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
  • --sample_folder

    • Default: /<PATH>
    • Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
  • --extension

    • Default: .fastq.gz
    • Comment: The expected extension for the input read files.
  • --pattern

    • Default: "*_R{1,2}*"
    • Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this READ_NAME_R1_MoreText.fastq.gz or READ_NAME_R1.fastq.gz
  • --read_type

    • Default: PE
    • Comment: Options: PE and SE. Default: PE. Type of reads: paired end (PE) or single end (SE).
  • --concat_lanes

    • Default: false
    • Comment: Options: false and true. Default: false. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
  • --csv_input

    • Default: null
    • Comment: Provide a CSV manifest file with the header: "sampleID,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, * --download_data can be specified.
  • --download_data

    • Default: null
    • Comment: Requires * --csv_input. When specified, read data in the CSV manifest will be downloaded from provided URLs with Aria2.
  • --gen_org

    • Default: mouse
    • Comment: Options: mouse and human.
  • --genome_build

    • Default: GRCm38
    • Comment: Mouse specific. Options: GRCm38 or GRCm39. If gen_org == human, build defaults to GRCh38.
  • --pdx

    • Default: false
    • Comment: Options: false, true. If specified, 'Xengsort' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis.
  • --classifier_table

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/rna_ebv_classifier/EBVlym_classifier_table_48.txt'
    • Comment: EBV expected gene signatures used in EBV classifier. Only used when '--pdx' is run.
  • --ref_fa

    • Default: '/projects/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.fasta'
    • Comment: Xengsort graft fasta file. Used by Xengsort Index when --pdx is run, and xengsort_idx_path is null or false.
  • --xengsort_host_fasta

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/imputed/rel_2112_v8/NOD_ShiLtJ.39.fa'
    • Comment: Xengsort host fasta file. Used by Xengsort Index when --pdx is run, and xengsort_idx_path is null or false.
  • --xengsort_idx_path

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/xengsort'
    • Comment: Xengsort index for deconvolution of human and mouse reads. Used when --pdx is run. If null, Xengsort Index is run using ref_fa and host_fa.
  • --xengsort_idx_name

    • Default: 'hg38_GRCm39-NOD_ShiLtJ'
    • Comment: Xengsort index name associated with files located in xengsort_idx_path or name given to outputs produced by Xengsort Index.
  • --strandedness_ref

    • Default: '/projects/compsci/omics_share/mouse/GRCm38/transcriptome/indices/ensembl/v102/kallisto/kallisto_index'
    • Comment: Modified kallisto index file used only in strandedness determination.
  • --strandedness_gtf

    • Default: '/projects/compsci/omics_share/mouse/GRCm38/transcriptome/annotation/ensembl/v102/Mus_musculus.GRCm38.102.gtf'
    • Comment: GTF file used with kallisto index file used only in strandedness determination.
  • --strandedness

    • Default: null
    • Comment: Library strandedness override. Supported options are 'reverse_stranded' or 'forward_stranded' or 'non_stranded'. This override parameter is only used when the tool check_strandedness fails to classify the strandedness of a sample. If check_strandedness provides a strand determination, that setting is used.
  • --quality_phred

    • Default: 15
    • Comment: The quality value that is required for a base to pass. Default: 15 which is a phred quality score of >=Q15.
  • --unqualified_perc

    • Default: 40
    • Comment: Percent of bases that are allowed to be unqualified (0~100). Default: 40 which is 40%.
  • --detect_adapter_for_pe

    • Default: false
    • Comment: If true, adapter auto-detection is used for paired end data. By default, paired-end data adapter sequence auto-detection is disabled as the adapters can be trimmed by overlap analysis. However, --detect_adapter_for_pe will enable it. Fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
  • --rsem_ref_files

    • Default: /projects/omics_share/mouse/GRCm38/transcriptome/indices/ensembl/v102/bowtie2
    • Comment: Pre-compiled index files. Refers to human indices when * --gen_org human. JAX users should not change this, unless using STAR indices.
  • --rsem_ref_prefix

    • Default: 'Mus_musculus.GRCm38.dna.primary_assembly'
    • Comment: Prefix for index files. JAX users should not change this, unless using STAR indices. Refers to human indices when * --gen_org human.
  • --seed_length

    • Default: 25
    • Comment: "Seed length used by the read aligner. Providing the correct value is important for RSEM. If RSEM runs Bowtie, it uses this value for Bowtie's seed length parameter."
  • --rsem_aligner

    • Default: 'bowtie2'
    • Comment: Options: bowtie2 or star. The aligner algorithm used by RSEM. Note, if using STAR, point rsem_ref_files to STAR based indices.
  • --merge_rna_counts

    • Default: false
    • Comment: Options false, true. If specified, gene and transcript counts are merged across all samples. Typically used in multi-sample cases.
  • --picard_dict

    • Default: '/projects/omics_share/mouse/GRCm38/genome/sequence/ensembl/v102/Mus_musculus.GRCm38.dna.toplevel.dict'
    • Comment: The coverage metric calculation step requires this file. Refers to human assembly when * --gen_org human. JAX users should not change this parameter.
  • --ref_flat

    • Default: '/projects/omics_share/mouse/GRCm38/transcriptome/annotation/ensembl/v102/Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.refFlat.txt'
    • Comment: The coverage metric calculation step requires this file. Refers to human assembly when * --gen_org human. JAX users should not change this parameter.
  • --ribo_intervals

    • Default: '/projects/omics_share/mouse/GRCm38/transcriptome/annotation/ensembl/v102/Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.rRNA.interval_list'
    • Comment: The coverage metric calculation step requires this file. Refers to human assembly when * --gen_org human. JAX users should not change this parameter.

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

Naming Convention Description
rnaseq_report.html Nextflow autogenerated report
trace.txt Nextflow trace of processes
multiqc MultiQC report summarizing quality metrics across samples in the analysis run.
*/bam/*.genome.bam Output of rsem calculate expression module process RSEM_ALIGNMENT_EXPRESSION
*/bam/*.transcript.bam Output of rsem calculate expression module process RSEM_ALIGNMENT_EXPRESSION
*/stats/*.genes.results Output of rsem calculate expression module process RSEM_ALIGNMENT_EXPRESSION
*/stats/*.isoforms.results Output of rsem calculate expression module process RSEM_ALIGNMENT_EXPRESSION
*/stats/*.fastq.gz_stat Statistics output from quality trimming using Jax Timmer process
*/stats/rsem_aln_*.stats Statistics output from RSEM_ALIGNMENT_EXPRESSION
*/stats/*_read_group.txt Read group information from sample processed.

PDX Outputs

NOTE: If --pdx is run, sample output directories vary slightly from above. Three output directories per sample are generated:

  1. <SAMPLE_ID>: with all stats generated for the sample as above with the addition of the following file:
Naming Convention Description
*xengsort_log.txt Xengsort statistics file
  1. <SAMPLE_ID>_human with all human specific quantification (e.g., genes.results) and alignments (e.g, genome.bam) as above:

  2. <SAMPLE_ID>_mouse with all mouse specific quantification and outputs

Pipeline Optional Outputs

These output will only be saved when --keep_intermediate true is specified.

Naming Convention Description (--keep_intermediate true)
*/*_read_group.txt Read groups for fastq files from READ_GROUPS
*/bam/*_genome_bam_with_read_group_reorder.bam From PICARD_REORDERSAM
*/bam/*_genome_bam_with_read_groups.bam From PICARD_ADDORREPLACEREADGROUPS
*/bam/*_sortsam.bam From PICARD_SORTSAM

CSV Input Sample Sheet

The required input header is: sampleID,lane,fastq_1,fastq_2. Samples can be provided either paired or un-paired.

  • The sampleID column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID.
  • The lane column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis.
  • The fastq_1 and fastq_2 columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.

Basic examples:

An example PE csv file:

sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz

An example SE csv file:

sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz
Clone this wiki locally