Skip to content

Generate Pseudoreference Pipeline ReadMe

MikeWLloyd edited this page Apr 11, 2024 · 7 revisions

Generate Pseudoreference Documentation

Generate-Pseudoreference Pipeline (--workflow generate_pseudoreference)

•   Step 1: Convert VCF to VCI (chain file equivalent)  
•   Step 2: Path SNPs into reference  
•   Step 3: Transform InDELs into patched reference  
•   Step 4: Convert the reference GTF to strain specific GTF  
•   Step 5: Convert strain specific GTF to database format  
•   Step 6a: Extract sequence from strain specific genes in fasta format
•   Step 6b: Extract sequence from strain specific transcripts in fasta format
•   Step 6c: Extract sequence from strain specific exons in fasta format   

Generate Pseduoreference Flowchart

flowchart TD
    p00((Base\nGenome\nGTF))
    p000((Base\nGenome))
    p0000((Strain Specific\nSNPs\nand InDELs))
    p0[FILTER_GTF]

    p1[G2GTOOLS_VCF2VCI]

    p2[G2GTOOLS_PATCH]

    p3[G2GTOOLS_TRANSFORM]
    p4[SAMTOOLS_FAIDX_G2GTOOLS]

    p5[G2GTOOLS_CONVERT]

    p6[APPEND_DROPPED_CHROMS]
    p7[G2GTOOLS_GTF2DB]

    p8[G2GTOOLS_EXTRACT_GENES]

    p9[G2GTOOLS_EXTRACT_TRANSCRIPTS]

    p10[G2GTOOLS_EXTRACT_EXONS]

    p11[SAMTOOLS_FAIDX]

    o1([Filtered GTF]):::output
    o2([SNP/InDEL Patched and Transformed FASTA]):::output
    o3([FASTA index]):::output
    o4([Strain Specific GTF]):::output
    o5([Strain Specific Exons]):::output
    o6([Strain Specific Transcripts]):::output
    o7([Strain Specific Genes]):::output
    o8([Base Genome Index]):::output

    p00 --> p0
    p0 --> o1
    p0 --> p1
    p000 --> p2
    p0000 --> p1
    p1 --> p2
    p2 --> p3
    p3 --> o2
    p3 --> p4
    p4 --> o3
    p3 --> p5
    p5 --> |Default:\nThough Optional| p6
    p6 --> o4
    p5 -..-> |If Append Not Used| o4
    o4 --> p7
    p7 --> p8
    p7 --> p9
    p7 --> p10

    p8 --> o7
    p9 --> o6
    p10 --> o5

    p11 --> o8

    
    classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
Loading

Parameters for Generate-Pseudoreference Pipeline

  • --pubdir

    • Default: /<PATH>
    • Comment: The directory that the saved outputs will be stored.
  • -w

    • Default: /<PATH>
    • Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
  • --snp_vcf

    • Default: /projects/compsci/omics_share/mouse/GRCm39/genome/annotation/snps_indels/rel_2112_v8/mgp_REL2021_snps.vcf.gz
    • Comment: VCF containing only SNPs for patching into primary reference fasta.
  • --indel_vcf

    • Default: /projects/compsci/omics_share/mouse/GRCm39/genome/annotation/snps_indels/rel_2112_v8/mgp_REL2021_indels.vcf.gz
    • Comment: VCF containing only InDELs for transforming into primary reference fasta.
  • --primary_reference_fasta

    • Default: /projects/compsci/omics_share/mouse/GRCm39/genome/sequence/ensembl/v105/Mus_musculus.GRCm39.dna.primary_assembly.fa
    • Comment: The primary reference fasta file, where patched SNPs and transformed InDELs are applied.
  • --primary_reference_gtf

    • Default: /projects/compsci/omics_share/mouse/GRCm39/transcriptome/annotation/ensembl/v105/Mus_musculus.GRCm39.105.gtf
    • Comment: The primary reference gtf file, used to patch and transform gene/transcripts/exons.
  • --strain

    • Default: 129S1_SvImJ,A_J,CAST_EiJ,NOD_ShiLtJ,NZO_HlLtJ,PWK_PhJ,WSB_EiJ
    • Comment: A comma delimited string of strains/haplotypes. (e.g., 'A_J,CAST_EiJ,...')
  • --genome_version

    • Default: 39
    • Comment: A genome ID string (e.g., 39)
  • --gtf_biotype_include

    • Default: protein_coding,lncRNA,IG_C_gene,IG_D_gene,IG_J_gene,IG_LV_gene,IG_V_gene,TR_C_gene,TR_D_gene,TR_J_gene,TR_V_gene
    • Comment: A comma delimited list of terms to include from the full GTF. All other biotype terms will be excluded.
  • --append_chromosomes

    • Default: true
    • Add back any full chromosomes that are dropped due to lack of variants in the SNP or InDEL file.
      Example: No variants called on chrM, but chrM should be present in the GTF file for downstream EMASE/GBRS use.
      With no variants present, in the G2Gtools convert step, chrM would be dropped into the 'unmapped' file.
      If append_chromosomes == true, then all fully missing chromosomes will be added back to the GTF in the
      convert step. The appended annotations in the GTF will be in the source genome coordinates,
      as no SNPs/InDELs were present.
  • --diploid

    • Default: false
    • Comment: Create diploid VCI file
  • --keep_fails

    • Default: false
    • Comment: Keep track of VCF lines that could not be converted to VCI file
  • --pass_only

    • Default: false
    • Comment: Use only VCF lines that have a PASS for the filter value
  • --quality_filter

    • Default: NULL
    • Comment: Filter on quality, (e.g., 'FI=PASS')
  • --region

    • Default: NULL
    • Comment: A region used in extraction in the following format:seqid:start-end. If using this option, the bed option can not be used.
  • --bed

    • Default: NULL
    • Comment: A BED file with regions for extraction. This option cannot be used with the region flag.

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

Naming Convention Description
generate_pseudoreference_report.html Nextflow autogenerated report
trace.txt Nextflow trace of processes
Mus_musculus.GRCm39.105.filtered.gtf Optional: Filtered GTF file, present if gtf_biotype_include is specified. Terms provided in that list are included, all other biotype terms are excluded
*/g2gtools/*.fa* SNP/InDEL patched and transformed pseudo-reference fasta and fasta index files generated by g2gtools
*/g2gtools/*.vci.gz* VCI file that contains all SNP and InDEL coordinates conversions between the strain and the primary reference. Chain file equivalent
*/g2gtools/*.gtf GTF file in strain coordinates
*/g2gtools/*.gtf.db GTF database file in g2gtools format. Used for extraction of strain specific sequences (genes, transcripts, and exons)
*/g2gtools/*.genes.fa Fasta formatted strain specific gene sequences
*/g2gtools/*.transcripts.fa Fasta formatted strain specific transcript sequences
*/g2gtools/*.exons.fa Fasta formatted strain specific exon sequences

Pipeline Options Outputs

These output will only be saved when --keep_intermediate true is specified.

Naming Convention Description (--keep_intermediate true)
*/g2gtools/*.patched.fa Fasta formatted patched snps without InDELs
Clone this wiki locally