Skip to content

EMASE Pipeline ReadMe

MikeWLloyd edited this page Aug 1, 2024 · 11 revisions

EMASE Documentation

EMASE Pipeline (--workflow emase)

•   Step 1: Map reads with BOWTIE, and convert SAM to BAM. Note, R1 and R2 are mapped separately for paired-end data.  
•   Step 2: Convert BAM to EMASE. Note: R1 and R2 are converted separately for paired-end data.  
•   Step 3: Paired-end data: Find common alignments and compress EMASE file.  
•   Step 3: Single-end data: Compress EMASE file.  
•   Step 4: Quantify multiway expression. Note that this step while called 'GBRS Quantify' uses the EMASE algorithm. EMASE and GBRS share a common code base.   

EMASE Flowchart

flowchart TD
    p0((Sample))
    p3[BOWTIE_R1]
    p3.5[BOWTIE_R1]
    p4[BOWTIE_R2]
    p5[GBRS_BAM2EMASE_R1]
    p5.5[GBRS_BAM2EMASE_R1]
    p6[GBRS_BAM2EMASE_R2]
    p7[EMASE_GET_COMMON_ALIGNMENT]
    p8[GBRS_COMPRESS_PE]
    p8.5[GBRS_COMPRESS_SE]
        

    p9[GBRS_QUANTIFY]

    note1{Paired\nEnd\nData}
    note2{Single\nEnd\nData}

    o1([Compressed EMASE h5 File]):::output

    o2([EMASE Isoform\nTPM Counts]):::output
    o3([EMASE Isoform\nExpected Counts]):::output
    o4([EMASE Isoform\nAlignment Counts]):::output
    o5([EMASE Gene\nTPM Counts]):::output
    o6([EMASE Gene\nExpected Counts]):::output
    o7([EMASE Gene\nAlignment Counts]):::output

    p0 -..-> |If Paired End Data| note1
    p0 -..-> |If Single End Data| note2

    subgraph paired_end [  ]
        note1 --> p3
        note1 --> p4
        subgraph read_1 [  ]

        p3 --> p5
        end

        subgraph read_2 [  ]

        p4 --> p6
        end
        p5 --> p7
        p6 --> p7
        p7 --> p8

    end

    subgraph single_end [  ]
        note2 --> p3.5
        p3.5 --> p5.5
        p5.5 --> p8.5

    end
    p8 --> o1
    p8.5 --> o1
    o1 --> p9

    p9 --> o2

    p9 --> o3
    p9 --> o4
    p9 --> o5
    p9 --> o6
    p9 --> o7

classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
Loading

Parameters for EMASE Pipeline

  • --pubdir

    • Default: /<PATH>
    • Comment: The directory that the saved outputs will be stored.
  • -w

    • Default: /<PATH>
    • Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
  • --csv_input

    • Default: null
    • Comment: Provide a CSV manifest file with the header: "sampleID,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, * --download_data can be specified.
  • --sample_folder

    • Default: /<PATH>
    • Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
  • --extension

    • Default: .fastq.gz
    • Comment: The expected extension for the input read files.
  • --pattern

    • Default: "*_R{1,2}*"
    • Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this READ_NAME_R1_MoreText.fastq.gz or READ_NAME_R1.fastq.gz
  • --read_type

    • Default: PE
    • Comment: Options: PE and SE. Type of reads: paired end (PE) or single end (SE).
  • --concat_lanes

    • Default: false
    • Comment: Options: false and true. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
  • --concat_sampleID_delim

    • Default: "_"
    • Comment: The delimited to split FASTQ file names.
  • --concat_sampleID_positions

    • Default: "1"
    • Comment: The number of elements to keep after splitting on the chosen delimiter in the sample name.
  • --genome_build

    • Default: GRCm39
    • Comment: Options: GRCm39 or GRCm38.
  • --bowtie_index

    • Default: /projects/compsci/omics_share/mouse/GRCm39/transcriptome/indices/imputed/rel_2112_v8/bowtie/bowtie.transcripts
    • Comment: Path to the bowtie index. Include the bowtie prefix in this path (e.g., /path/to/bowtie.transcripts where bowtie.transcripts.* are the full set of index files in the directory.
  • --transcripts_info

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.fullTranscripts.info
    • Comment: A file containing all transcript IDs. NOTE: These IDs must not contain haplotype IDs. This file must also have a 'length' column. Note that 'length' is not used in this context. ONLY IDs are used from this file. Can be obtained from prepare_emase workflow (emase.fullTranscripts.info)
  • --gbrs_strain_list

    • Default: "A,B,C,D,E,F,G,H"
    • Comment: A list of haplotype names corresponding to genomes used in hybrid genome construction (e.g., 'A,B,C,D,E,F,G,H').
  • --gene2transcript_csv

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.gene2transcripts.tsv
    • Comment: A file containing all gene to transcript ID translations. NOTE: These IDs must not contain haplotype IDs. Can be obtained from prepare_emase workflow (emase.gene2transcripts.tsv)
  • --full_transcript_info

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.pooled.fullTranscripts.info
    • Comment: A file containing all transcript IDs with transcript lengths. NOTE: These IDs must contain haplotype IDs. Can be obtained from prepare_emase workflow (emase.pooled.fullTranscripts.info)
  • --emase_model

    • Default: '4'
    • Comment: Options:(1, 2, 3, 4). 1: reads are apportioned among genes first, then between alleles, and then among isoforms. 2: reads are apportioned among genes first, then among isoforms, and then between alleles. 3: reads are apportioned among genes first, then among each isoform-allele combination.

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

Naming Convention Description
emase_report.html Nextflow autogenerated report
trace.txt Nextflow trace of processes
*/gbrs/*.emase.h5 EMASE/GBRS compressed BAM h5 file
*/emase/*.isoforms.alignment_counts Multiway raw alignment counts for all transcript isoforms
*/emase/*.isoforms.expected_read_counts Multiway expected read counts for all transcript isoforms
*/emase/*.isoforms.tpm Multiway transcript per million (tpm) for all transcript isoforms
*/emase/*.genes.alignment_counts Multiway raw alignment counts for all genes
*/emase/*.genes.expected_read_counts Multiway expected read counts for all genes
*/emase/*.genes.tpm Multiway transcript per million (tpm) for all genes
*/stats/*.bowtie_R1.log Statistics output from read1 mapping from Bowtie
*/stats/*.bowtie_R2.log Statistics output from read2 mapping from Bowtie, if data are paired end

Pipeline Options Outputs

There are no optional outputs for this workflow.

CSV Input Sample Sheet

The required input header is: sampleID,lane,fastq_1,fastq_2. Samples can be provided either paired or un-paired.

  • The sampleID column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID.
  • The lane column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis.
  • The fastq_1 and fastq_2 columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.

Basic examples:

An example PE csv file:

sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz

An example SE csv file:

sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz
Clone this wiki locally