Prepare DO GBRS Input Pipeline ReadMe

Prepare Diversity Outbred GBRS Input Files Documentation

NOTE: This workflow is hard-coded to generate files in GRCm39 coordinates. There are two reasons for this:

The required conversion of gene and marker positions from base pair locations to centimorgan positions, which requires an R package that only supports GRCm39.
The pseudo-marker generation step is based on telomeric positions in GRCm39 coordinates.

Prepare DO GBRS Input Pipeline (--workflow prep_do_gbrs_inputs)

•   Step 1: Generate transition probabilities  
•   Step 2: Parse transition probabilities to NPZ format  
•   Step 3: Parse gene position file to NPZ format  
•   Step 4: Generate pseudo-marker grid

Prepare DO GBRS Input Flowchart

flowchart TD
    i1((Generation\nCount))

    p0[DO_TRANSITION_PROBABILITIES]

    p1[PARSE_TRANSITION_PROBABILITIES_FEMALE]

    p2[PARSE_TRANSITION_PROBABILITIES_MALE]

    p3[PARSE_GENE_POSITONS]

    p4[GENERATE_GRID_FILE]

    o1([Female Transition Probabilities\By Generation]):::output
    o2([Female Transition Probabilities\By Generation]):::output
    o3([Gene Position NPZ File]):::output
    o4([Pseudo Genotype Grid Reference]):::output

    i1 --> p0
    p0 --> p1
    p0 --> p2
    p0 --> p3

    p1 --> o1
    p2 --> o2
    p3 --> o3
    
    p4 --> o4

    classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000

Parameters for Generate-Pseudoreference Pipeline

--pubdir
- Default: /<PATH>
- Comment: The directory that the saved outputs will be stored.
-w
- Default: /<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
--num_generations
- Default: 100
- Comment: The number of generations to calculate transition probabilities.
--ensembl_build
- 105
- Comment: The ensembl build number used to extract gene names and locations from the R package biomaRt.
--emase_gene2transcript
- /<PATH>
- Comment: A file containing all gene to transcript ID translations. NOTE: These IDs must not contain haplotype IDs. Can be obtained from prepare_emase workflow (emase.gene2transcripts.tsv)

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

Naming Convention	Description
`prep_do_gbrs_inputs_report.html`	Nextflow autogenerated report
`trace.txt`	Nextflow trace of processes
`tranprob.DO.G..npz`	Transition probability files for each generation and sex
`gene_list_ensemblBuild_*.tsv`	Tab delimited file of genes and gene positions in relevant build coordinates
`ref.gene_pos.ordered_ensBuild_*.npz`	List of genes and gene positions in relevant build coordinates in NPZ format
`ref.genome_grid.GRCm39.tsv`	Pseudo-marker grid file
`transprob_matrix_plots/*.pdf`	Validation plots showing the decay of probabilities among haplotypes by generation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare DO GBRS Input Pipeline ReadMe

Prepare Diversity Outbred GBRS Input Files Documentation

Prepare DO GBRS Input Pipeline (--workflow prep_do_gbrs_inputs)

Prepare DO GBRS Input Flowchart

Parameters for Generate-Pseudoreference Pipeline

Pipeline Default Outputs

Pipeline Options Outputs

Home

Pipeline Documentation

Benchmarking Documentation

Pipeline development and Release Documentation

Clone this wiki locally