Skip to content

GBRS Pipeline ReadMe

MikeWLloyd edited this page Aug 1, 2024 · 11 revisions

GBRS Documentation

NOTE: EMASE and GBRS are part of the same software package, and share functions between them. This workflow outputs from EMASE and GBRS, but they are derived from the same software package.

GBRS Pipeline (--workflow gbrs)

•   Step 1: Map reads with BOWTIE, and convert SAM to BAM. Note, R1 and R2 are mapped separately for paired-end data.  
•   Step 2: Convert BAM to EMASE. Note: R1 and R2 are converted separately for paired-end data.  
•   Step 3: Paired-end data: Find common alignments and compress EMASE file.  
•   Step 3: Single-end data: Compress EMASE file.  
•   Step 4: Quantify multiway expression  
•   Step 5: Genotype reconstruction   
•   Step 6: Quantify diploid expression with GBRS   
•   Step 7: Interpolate genotypes and genotype probabilities    
•   Step 8: Plot inferred genotypes  
•   Step 9: Export genotype probabilities      

GBRS Flowchart

flowchart TD
    p0((Sample))
    p3[BOWTIE_R1]
    p3.5[BOWTIE_R1]
    p4[BOWTIE_R2]
    p5[GBRS_BAM2EMASE_R1]
    p5.5[GBRS_BAM2EMASE_R1]
    p6[GBRS_BAM2EMASE_R2]
    p7[EMASE_GET_COMMON_ALIGNMENT]
    p8[GBRS_COMPRESS_PE]
    p8.5[GBRS_COMPRESS_SE]

    p9[GBRS_QUANTIFY]

    note1{Paired\nEnd\nData}
    note2{Single\nEnd\nData}

    o1([Compressed EMASE h5 File]):::output

    o2([EMASE Isoform\nTPM Counts]):::output
    o3([EMASE Isoform\nExpected Counts]):::output
    o4([EMASE Isoform\nAlignment Counts]):::output
    o5([EMASE Gene\nTPM Counts]):::output
    o6([EMASE Gene\nExpected Counts]):::output
    o7([EMASE Gene\nAlignment Counts]):::output

    p10[GBRS_RECONSTRUCT]

    p11[GBRS_QUANTIFY_GENOTYPES]

    p12[GBRS_INTERPOLATE]
    p13[GBRS_PLOT]
    p14[GBRS_EXPORT]
    
    o8([GBRS Isoform\nTPM Counts]):::output
    o9([GBRS Isoform\nExpected Counts]):::output
    o10([GBRS Isoform\nAlignment Counts]):::output
    o11([GBRS Gene\nTPM Counts]):::output
    o12([GBRS Gene\nExpected Counts]):::output
    o13([GBRS Gene\nAlignment Counts]):::output

    o14([Infered Genotypes\nby Gene]):::output
    o15([Plotted Genome]):::output
    o16([Interpolated Genoprobs]):::output

    p0 -..-> |If Paired End Data| note1
    p0 -..-> |If Single End Data| note2

    subgraph paired_end [  ]
        note1 --> p3
        note1 --> p4
        subgraph read_1 [  ]

        p3 --> p5
        end

        subgraph read_2 [  ]

        p4 --> p6
        end

        p5 --> p7
        p6 --> p7
        p7 --> p8

    end

    subgraph single_end [  ]
        note2 --> p3.5
        p3.5 --> p5.5
        p5.5 --> p8.5

    end
    p8 --> o1
    p8.5 --> o1
    o1 --> p9

    p9 --> o2
    p9 --> o3
    p9 --> o4
    p9 --> o5
    p9 --> o6
    p9 --> o7
    

    o5 --> p10
    p10 --> o14
    p10 --> p11
    p10 --> p12
    o1 --> p11
    p12 --> p13
    p12 --> p14

    p11 --> o8
    p11 --> o9
    p11 --> o10
    p11 --> o11
    p11 --> o12
    p11 --> o13

    p13 --> o15
    p14 --> o16

classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000

Loading

Parameters for GBRS Pipeline

  • --pubdir

    • Default: /<PATH>
    • Comment: The directory that the saved outputs will be stored.
  • -w

    • Default: /<PATH>
    • Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
  • --csv_input

    • Default: null
    • Comment: Provide a CSV manifest file with the header: "sampleID,sex,generation,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, --download_data can be specified.
  • --sample_folder

    • Default: /<PATH>
    • Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
  • --extension

    • Default: .fastq.gz
    • Comment: The expected extension for the input read files.
  • --pattern

    • Default: "*_R{1,2}*"
    • Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this READ_NAME_R1_MoreText.fastq.gz or READ_NAME_R1.fastq.gz
  • --read_type

    • Default: PE
    • Comment: Options: PE and SE. Type of reads: paired end (PE) or single end (SE).
  • --concat_lanes

    • Default: false
    • Comment: Options: false and true. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
  • --concat_sampleID_delim

    • Default: "_"
    • Comment: The delimited to split FASTQ file names.
  • --concat_sampleID_positions

    • Default: "1"
    • Comment: The number of elements to keep after splitting on the chosen delimiter in the sample name.
  • --genome_build

    • Default: GRCm39
    • Comment: Options: GRCm39 or GRCm38.
  • --bowtie_index

    • Default: /projects/compsci/omics_share/mouse/GRCm39/transcriptome/indices/imputed/rel_2112_v8/bowtie/bowtie.transcripts
    • Comment: Path to the bowtie index. Include the bowtie prefix in this path (e.g., /path/to/bowtie.transcripts where bowtie.transcripts.* are the full set of index files in the directory.
  • --transcripts_info

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.fullTranscripts.info
    • Comment: A file containing all transcript IDs. NOTE: These IDs must not contain haplotype IDs. This file must also have a 'length' column. Note that 'length' is not used in this context. ONLY IDs are used from this file. Can be obtained from prepare_emase workflow (emase.fullTranscripts.info)
  • --gbrs_strain_list

    • Default: "A,B,C,D,E,F,G,H"
    • Comment: A list of haplotype names corresponding to genomes used in hybrid genome construction (e.g., 'A,B,C,D,E,F,G,H').
  • --gene2transcript_csv

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.gene2transcripts.tsv
    • Comment: A file containing all gene to transcript ID translations. NOTE: These IDs must not contain haplotype IDs. Can be obtained from prepare_emase workflow (emase.gene2transcripts.tsv)
  • --full_transcript_info

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.pooled.fullTranscripts.info
    • Comment: A file containing all transcript IDs with transcript lengths. NOTE: These IDs must contain haplotype IDs. Can be obtained from prepare_emase workflow (emase.pooled.fullTranscripts.info)
  • --emase_model

    • Default: '4'
    • Comment: Options:(1, 2, 3, 4). 1: reads are apportioned among genes first, then between alleles, and then among isoforms. 2: reads are apportioned among genes first, then among isoforms, and then between alleles. 3: reads are apportioned among genes first, then among each isoform-allele combination.
  • --emission_prob_avecs

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/gbrs_emissions_all_tissues.avecs.npz
    • Comment: The emission probability vector file.
  • --trans_prob_dir

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/transition_probabilities
    • Comment: A directory containing all transition probability files.
  • --gene_position_file

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/ref.gene_pos.ordered_ensBuild_105.npz
    • Comment: A python compressed NPZ file containing gene position in primary reference coordinates.
  • --genotype_grid

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/ref.genome_grid.GRCm39.tsv
    • Comment: Simulated marker grid used in genotype inference.
  • --founder_hex_colors

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/founder.hexcolor.info
    • Comment: A file containing hexcode colors for plotting genotypes.
  • --gbrs_expression_threshold

    • Default: 1.5
    • Comment: GBRS expression threshold limit required to quantify a gene.
  • --gbrs_sigma

    • Default: 0.12
    • Comment: GBRS scaling factor.

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

Naming Convention Description
emase_report.html Nextflow autogenerated report
trace.txt Nextflow trace of processes
*/gbrs/*.emase.h5 EMASE/GBRS compressed BAM h5 file
*/emase/*.isoforms.alignment_counts Multiway raw alignment counts for all transcript isoforms
*/emase/*.isoforms.expected_read_counts Multiway expected read counts for all transcript isoforms
*/emase/*.isoforms.tpm Multiway transcript per million (tpm) for all transcript isoforms
*/emase/*.genes.alignment_counts Multiway raw alignment counts for all genes
*/emase/*.genes.expected_read_counts Multiway expected read counts for all genes
*/emase/*.genes.tpm Multiway transcript per million (tpm) for all genes
*/gbrs/*.isoforms.alignment_counts Diploid raw alignment counts for all transcript isoforms
*/gbrs/*.isoforms.expected_read_counts Diploid expected read counts for all transcript isoforms
*/gbrs/*.isoforms.tpm Diploid transcript per million (tpm) for all transcript isoforms
*/gbrs/*.genes.alignment_counts Diploid raw alignment counts for all genes
*/gbrs/*.genes.expected_read_counts Diploid expected read counts for all genes
*/gbrs/*.genes.tpm Diploid transcript per million (tpm) for all genes
*/gbrs/*.interpolated.genoprobs.npz Interpolated genotype probabilities in python NPZ format
*/gbrs/*.interpolated.genoprobs.tsv Interpolated genotype probabilities in tab delimited format
*/gbrs/*.plotted.genome.pdf Plotted interpolated genotypes
*/gbrs/*.genotypes.tsv Inferred genotypes for all genes
*/stats/*.bowtie_R1.log Statistics output from read1 mapping from Bowtie
*/stats/*.bowtie_R2.log Statistics output from read2 mapping from Bowtie, if data are paired end

Pipeline Options Outputs

There are no optional outputs for this workflow.

CSV Input Sample Sheet

The required input header is: sampleID,lane,fastq_1,fastq_2. Samples can be provided either paired or un-paired.

  • The sampleID column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID.
  • The lane column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis.
  • The fastq_1 and fastq_2 columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.

Basic examples:

An example PE csv file:

sampleID,sex,generation,lane,fastq_1,fastq_2
Sample_42,F,42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz
Sample_42,F,42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz
Sample_101,M,10,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz
Sample_10191,M,80,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz

An example SE csv file:

sampleID,sex,generation,lane,fastq_1,fastq_2
Sample_42,F,42,Lane_1,/path/to/sample_42_001_R1.fastq.gz
Sample_42,F,42,Lane_2,/path/to/sample_42_002_R1.fastq.gz
Sample_101,M,10,Lane_1,/path/to/sample_101_001_R1.fastq.gz
Sample_10191,M,80,Lane_1,/path/to/sample_10191_001_R1.fastq.gz
Clone this wiki locally