-
Notifications
You must be signed in to change notification settings - Fork 10
Generate Pseudoreference Pipeline ReadMe
MikeWLloyd edited this page Apr 11, 2024
·
7 revisions
• Step 1: Convert VCF to VCI (chain file equivalent)
• Step 2: Path SNPs into reference
• Step 3: Transform InDELs into patched reference
• Step 4: Convert the reference GTF to strain specific GTF
• Step 5: Convert strain specific GTF to database format
• Step 6a: Extract sequence from strain specific genes in fasta format
• Step 6b: Extract sequence from strain specific transcripts in fasta format
• Step 6c: Extract sequence from strain specific exons in fasta format
flowchart TD
p00((Base\nGenome\nGTF))
p000((Base\nGenome))
p0000((Strain Specific\nSNPs\nand InDELs))
p0[FILTER_GTF]
p1[G2GTOOLS_VCF2VCI]
p2[G2GTOOLS_PATCH]
p3[G2GTOOLS_TRANSFORM]
p4[SAMTOOLS_FAIDX_G2GTOOLS]
p5[G2GTOOLS_CONVERT]
p6[APPEND_DROPPED_CHROMS]
p7[G2GTOOLS_GTF2DB]
p8[G2GTOOLS_EXTRACT_GENES]
p9[G2GTOOLS_EXTRACT_TRANSCRIPTS]
p10[G2GTOOLS_EXTRACT_EXONS]
p11[SAMTOOLS_FAIDX]
o1([Filtered GTF]):::output
o2([SNP/InDEL Patched and Transformed FASTA]):::output
o3([FASTA index]):::output
o4([Strain Specific GTF]):::output
o5([Strain Specific Exons]):::output
o6([Strain Specific Transcripts]):::output
o7([Strain Specific Genes]):::output
o8([Base Genome Index]):::output
p00 --> p0
p0 --> o1
p0 --> p1
p000 --> p2
p0000 --> p1
p1 --> p2
p2 --> p3
p3 --> o2
p3 --> p4
p4 --> o3
p3 --> p5
p5 --> |Default:\nThough Optional| p6
p6 --> o4
p5 -..-> |If Append Not Used| o4
o4 --> p7
p7 --> p8
p7 --> p9
p7 --> p10
p8 --> o7
p9 --> o6
p10 --> o5
p11 --> o8
classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
-
--pubdir
- Default:
/<PATH>
- Comment: The directory that the saved outputs will be stored.
- Default:
-
-w
- Default:
/<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
- Default:
-
--snp_vcf
- Default:
/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/snps_indels/rel_2112_v8/mgp_REL2021_snps.vcf.gz
- Comment: VCF containing only SNPs for patching into primary reference fasta.
- Default:
-
--indel_vcf
- Default:
/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/snps_indels/rel_2112_v8/mgp_REL2021_indels.vcf.gz
- Comment: VCF containing only InDELs for transforming into primary reference fasta.
- Default:
-
--primary_reference_fasta
- Default:
/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/ensembl/v105/Mus_musculus.GRCm39.dna.primary_assembly.fa
- Comment: The primary reference fasta file, where patched SNPs and transformed InDELs are applied.
- Default:
-
--primary_reference_gtf
- Default:
/projects/compsci/omics_share/mouse/GRCm39/transcriptome/annotation/ensembl/v105/Mus_musculus.GRCm39.105.gtf
- Comment: The primary reference gtf file, used to patch and transform gene/transcripts/exons.
- Default:
-
--strain
- Default:
129S1_SvImJ,A_J,CAST_EiJ,NOD_ShiLtJ,NZO_HlLtJ,PWK_PhJ,WSB_EiJ
- Comment: A comma delimited string of strains/haplotypes. (e.g., 'A_J,CAST_EiJ,...')
- Default:
-
--genome_version
- Default:
39
- Comment: A genome ID string (e.g., 39)
- Default:
-
--gtf_biotype_include
- Default:
protein_coding,lncRNA,IG_C_gene,IG_D_gene,IG_J_gene,IG_LV_gene,IG_V_gene,TR_C_gene,TR_D_gene,TR_J_gene,TR_V_gene
- Comment: A comma delimited list of terms to include from the full GTF. All other biotype terms will be excluded.
- Default:
-
--append_chromosomes
- Default:
true
- Add back any full chromosomes that are dropped due to lack of variants in the SNP or InDEL file.
Example: No variants called on chrM, but chrM should be present in the GTF file for downstream EMASE/GBRS use.
With no variants present, in the G2Gtools convert step, chrM would be dropped into the 'unmapped' file.
Ifappend_chromosomes
== true, then all fully missing chromosomes will be added back to the GTF in the
convert step. The appended annotations in the GTF will be in the source genome coordinates,
as no SNPs/InDELs were present.
- Default:
-
--diploid
- Default:
false
- Comment: Create diploid VCI file
- Default:
-
--keep_fails
- Default:
false
- Comment: Keep track of VCF lines that could not be converted to VCI file
- Default:
-
--pass_only
- Default:
false
- Comment: Use only VCF lines that have a PASS for the filter value
- Default:
-
--quality_filter
- Default:
NULL
- Comment: Filter on quality, (e.g., 'FI=PASS')
- Default:
-
--region
- Default:
NULL
- Comment: A region used in extraction in the following format:
seqid:start-end
. If using this option, the bed option can not be used.
- Default:
-
--bed
- Default:
NULL
- Comment: A BED file with regions for extraction. This option cannot be used with the region flag.
- Default:
NOTE: *
Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.
Naming Convention | Description |
---|---|
generate_pseudoreference_report.html |
Nextflow autogenerated report |
trace.txt |
Nextflow trace of processes |
Mus_musculus.GRCm39.105.filtered.gtf |
Optional: Filtered GTF file, present if gtf_biotype_include is specified. Terms provided in that list are included, all other biotype terms are excluded |
*/g2gtools/*.fa* |
SNP/InDEL patched and transformed pseudo-reference fasta and fasta index files generated by g2gtools |
*/g2gtools/*.vci.gz* |
VCI file that contains all SNP and InDEL coordinates conversions between the strain and the primary reference. Chain file equivalent |
*/g2gtools/*.gtf |
GTF file in strain coordinates |
*/g2gtools/*.gtf.db |
GTF database file in g2gtools format. Used for extraction of strain specific sequences (genes, transcripts, and exons) |
*/g2gtools/*.genes.fa |
Fasta formatted strain specific gene sequences |
*/g2gtools/*.transcripts.fa |
Fasta formatted strain specific transcript sequences |
*/g2gtools/*.exons.fa |
Fasta formatted strain specific exon sequences |
These output will only be saved when --keep_intermediate true
is specified.
Naming Convention | Description (--keep_intermediate true ) |
---|---|
*/g2gtools/*.patched.fa |
Fasta formatted patched snps without InDELs |