Skip to content

Prepare DO GBRS Input Pipeline ReadMe

MikeWLloyd edited this page Apr 11, 2024 · 7 revisions

Prepare Diversity Outbred GBRS Input Files Documentation

NOTE: This workflow is hard-coded to generate files in GRCm39 coordinates. There are two reasons for this:

  1. The required conversion of gene and marker positions from base pair locations to centimorgan positions, which requires an R package that only supports GRCm39.
  2. The pseudo-marker generation step is based on telomeric positions in GRCm39 coordinates.

Prepare DO GBRS Input Pipeline (--workflow prep_do_gbrs_inputs)

•   Step 1: Generate transition probabilities  
•   Step 2: Parse transition probabilities to NPZ format  
•   Step 3: Parse gene position file to NPZ format  
•   Step 4: Generate pseudo-marker grid    

Prepare DO GBRS Input Flowchart

flowchart TD
    i1((Generation\nCount))

    p0[DO_TRANSITION_PROBABILITIES]

    p1[PARSE_TRANSITION_PROBABILITIES_FEMALE]

    p2[PARSE_TRANSITION_PROBABILITIES_MALE]

    p3[PARSE_GENE_POSITONS]

    p4[GENERATE_GRID_FILE]

    o1([Female Transition Probabilities\By Generation]):::output
    o2([Female Transition Probabilities\By Generation]):::output
    o3([Gene Position NPZ File]):::output
    o4([Pseudo Genotype Grid Reference]):::output

    i1 --> p0
    p0 --> p1
    p0 --> p2
    p0 --> p3

    p1 --> o1
    p2 --> o2
    p3 --> o3
    
    p4 --> o4

    classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
Loading

Parameters for Generate-Pseudoreference Pipeline

  • --pubdir

    • Default: /<PATH>
    • Comment: The directory that the saved outputs will be stored.
  • -w

    • Default: /<PATH>
    • Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
  • --num_generations

    • Default: 100
    • Comment: The number of generations to calculate transition probabilities.
  • --ensembl_build

    • 105
    • Comment: The ensembl build number used to extract gene names and locations from the R package biomaRt.
  • --emase_gene2transcript

    • /<PATH>
    • Comment: A file containing all gene to transcript ID translations. NOTE: These IDs must not contain haplotype IDs. Can be obtained from prepare_emase workflow (emase.gene2transcripts.tsv)

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

Naming Convention Description
prep_do_gbrs_inputs_report.html Nextflow autogenerated report
trace.txt Nextflow trace of processes
tranprob.DO.G*.*.npz Transition probability files for each generation and sex
gene_list_ensemblBuild_*.tsv Tab delimited file of genes and gene positions in relevant build coordinates
ref.gene_pos.ordered_ensBuild_*.npz List of genes and gene positions in relevant build coordinates in NPZ format
ref.genome_grid.GRCm39.tsv Pseudo-marker grid file
transprob_matrix_plots/*.pdf Validation plots showing the decay of probabilities among haplotypes by generation

Pipeline Options Outputs

There are no optional outputs for this workflow.

Clone this wiki locally