-
Notifications
You must be signed in to change notification settings - Fork 10
EMASE Pipeline ReadMe
MikeWLloyd edited this page Aug 1, 2024
·
11 revisions
• Step 1: Map reads with BOWTIE, and convert SAM to BAM. Note, R1 and R2 are mapped separately for paired-end data.
• Step 2: Convert BAM to EMASE. Note: R1 and R2 are converted separately for paired-end data.
• Step 3: Paired-end data: Find common alignments and compress EMASE file.
• Step 3: Single-end data: Compress EMASE file.
• Step 4: Quantify multiway expression. Note that this step while called 'GBRS Quantify' uses the EMASE algorithm. EMASE and GBRS share a common code base.
flowchart TD
p0((Sample))
p3[BOWTIE_R1]
p3.5[BOWTIE_R1]
p4[BOWTIE_R2]
p5[GBRS_BAM2EMASE_R1]
p5.5[GBRS_BAM2EMASE_R1]
p6[GBRS_BAM2EMASE_R2]
p7[EMASE_GET_COMMON_ALIGNMENT]
p8[GBRS_COMPRESS_PE]
p8.5[GBRS_COMPRESS_SE]
p9[GBRS_QUANTIFY]
note1{Paired\nEnd\nData}
note2{Single\nEnd\nData}
o1([Compressed EMASE h5 File]):::output
o2([EMASE Isoform\nTPM Counts]):::output
o3([EMASE Isoform\nExpected Counts]):::output
o4([EMASE Isoform\nAlignment Counts]):::output
o5([EMASE Gene\nTPM Counts]):::output
o6([EMASE Gene\nExpected Counts]):::output
o7([EMASE Gene\nAlignment Counts]):::output
p0 -..-> |If Paired End Data| note1
p0 -..-> |If Single End Data| note2
subgraph paired_end [ ]
note1 --> p3
note1 --> p4
subgraph read_1 [ ]
p3 --> p5
end
subgraph read_2 [ ]
p4 --> p6
end
p5 --> p7
p6 --> p7
p7 --> p8
end
subgraph single_end [ ]
note2 --> p3.5
p3.5 --> p5.5
p5.5 --> p8.5
end
p8 --> o1
p8.5 --> o1
o1 --> p9
p9 --> o2
p9 --> o3
p9 --> o4
p9 --> o5
p9 --> o6
p9 --> o7
classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
-
--pubdir
- Default:
/<PATH>
- Comment: The directory that the saved outputs will be stored.
- Default:
-
-w
- Default:
/<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
- Default:
-
--csv_input
- Default:
null
- Comment: Provide a CSV manifest file with the header: "sampleID,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, *
--download_data
can be specified.
- Default:
-
--sample_folder
- Default:
/<PATH>
- Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
- Default:
-
--extension
- Default:
.fastq.gz
- Comment: The expected extension for the input read files.
- Default:
-
--pattern
- Default:
"*_R{1,2}*"
- Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this
READ_NAME_R1_MoreText.fastq.gz
orREAD_NAME_R1.fastq.gz
- Default:
-
--read_type
- Default:
PE
- Comment: Options:
PE
andSE
. Type of reads: paired end (PE) or single end (SE).
- Default:
-
--concat_lanes
- Default:
false
- Comment: Options:
false
andtrue
. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
- Default:
-
--concat_sampleID_delim
- Default:
"_"
- Comment: The delimited to split FASTQ file names.
- Default:
-
--concat_sampleID_positions
- Default:
"1"
- Comment: The number of elements to keep after splitting on the chosen delimiter in the sample name.
- Default:
-
--genome_build
- Default:
GRCm39
- Comment: Options: GRCm39 or GRCm38.
- Default:
-
--bowtie_index
- Default:
/projects/compsci/omics_share/mouse/GRCm39/transcriptome/indices/imputed/rel_2112_v8/bowtie/bowtie.transcripts
- Comment: Path to the bowtie index. Include the bowtie prefix in this path (e.g.,
/path/to/bowtie.transcripts
where bowtie.transcripts.* are the full set of index files in the directory.
- Default:
-
--transcripts_info
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.fullTranscripts.info
- Comment: A file containing all transcript IDs. NOTE: These IDs must not contain haplotype IDs. This file must also have a 'length' column. Note that 'length' is not used in this context. ONLY IDs are used from this file. Can be obtained from
prepare_emase
workflow (emase.fullTranscripts.info)
- Default:
-
--gbrs_strain_list
- Default:
"A,B,C,D,E,F,G,H"
- Comment: A list of haplotype names corresponding to genomes used in hybrid genome construction (e.g., 'A,B,C,D,E,F,G,H').
- Default:
-
--gene2transcript_csv
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.gene2transcripts.tsv
- Comment: A file containing all gene to transcript ID translations. NOTE: These IDs must not contain haplotype IDs. Can be obtained from
prepare_emase
workflow (emase.gene2transcripts.tsv)
- Default:
-
--full_transcript_info
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.pooled.fullTranscripts.info
- Comment: A file containing all transcript IDs with transcript lengths. NOTE: These IDs must contain haplotype IDs. Can be obtained from
prepare_emase
workflow (emase.pooled.fullTranscripts.info)
- Default:
-
--emase_model
- Default:
'4'
- Comment: Options:(1, 2, 3, 4). 1: reads are apportioned among genes first, then between alleles, and then among isoforms. 2: reads are apportioned among genes first, then among isoforms, and then between alleles. 3: reads are apportioned among genes first, then among each isoform-allele combination.
- Default:
NOTE: *
Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.
Naming Convention | Description |
---|---|
emase_report.html |
Nextflow autogenerated report |
trace.txt |
Nextflow trace of processes |
*/gbrs/*.emase.h5 |
EMASE/GBRS compressed BAM h5 file |
*/emase/*.isoforms.alignment_counts |
Multiway raw alignment counts for all transcript isoforms |
*/emase/*.isoforms.expected_read_counts |
Multiway expected read counts for all transcript isoforms |
*/emase/*.isoforms.tpm |
Multiway transcript per million (tpm) for all transcript isoforms |
*/emase/*.genes.alignment_counts |
Multiway raw alignment counts for all genes |
*/emase/*.genes.expected_read_counts |
Multiway expected read counts for all genes |
*/emase/*.genes.tpm |
Multiway transcript per million (tpm) for all genes |
*/stats/*.bowtie_R1.log |
Statistics output from read1 mapping from Bowtie |
*/stats/*.bowtie_R2.log |
Statistics output from read2 mapping from Bowtie, if data are paired end |
There are no optional outputs for this workflow.
The required input header is: sampleID,lane,fastq_1,fastq_2
. Samples can be provided either paired or un-paired.
- The
sampleID
column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID. - The
lane
column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis. - The
fastq_1
andfastq_2
columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.
sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz
sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz