-
Notifications
You must be signed in to change notification settings - Fork 10
Prepare DO GBRS Input Pipeline ReadMe
MikeWLloyd edited this page Apr 11, 2024
·
7 revisions
NOTE: This workflow is hard-coded to generate files in GRCm39 coordinates. There are two reasons for this:
- The required conversion of gene and marker positions from base pair locations to centimorgan positions, which requires an R package that only supports GRCm39.
- The pseudo-marker generation step is based on telomeric positions in GRCm39 coordinates.
• Step 1: Generate transition probabilities
• Step 2: Parse transition probabilities to NPZ format
• Step 3: Parse gene position file to NPZ format
• Step 4: Generate pseudo-marker grid
flowchart TD
i1((Generation\nCount))
p0[DO_TRANSITION_PROBABILITIES]
p1[PARSE_TRANSITION_PROBABILITIES_FEMALE]
p2[PARSE_TRANSITION_PROBABILITIES_MALE]
p3[PARSE_GENE_POSITONS]
p4[GENERATE_GRID_FILE]
o1([Female Transition Probabilities\By Generation]):::output
o2([Female Transition Probabilities\By Generation]):::output
o3([Gene Position NPZ File]):::output
o4([Pseudo Genotype Grid Reference]):::output
i1 --> p0
p0 --> p1
p0 --> p2
p0 --> p3
p1 --> o1
p2 --> o2
p3 --> o3
p4 --> o4
classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
-
--pubdir
- Default:
/<PATH>
- Comment: The directory that the saved outputs will be stored.
- Default:
-
-w
- Default:
/<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
- Default:
-
--num_generations
- Default:
100
- Comment: The number of generations to calculate transition probabilities.
- Default:
-
--ensembl_build
105
- Comment: The ensembl build number used to extract gene names and locations from the R package
biomaRt
.
-
--emase_gene2transcript
/<PATH>
- Comment: A file containing all gene to transcript ID translations. NOTE: These IDs must not contain haplotype IDs. Can be obtained from
prepare_emase
workflow (emase.gene2transcripts.tsv)
NOTE: *
Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.
Naming Convention | Description |
---|---|
prep_do_gbrs_inputs_report.html |
Nextflow autogenerated report |
trace.txt |
Nextflow trace of processes |
tranprob.DO.G*.*.npz |
Transition probability files for each generation and sex |
gene_list_ensemblBuild_*.tsv |
Tab delimited file of genes and gene positions in relevant build coordinates |
ref.gene_pos.ordered_ensBuild_*.npz |
List of genes and gene positions in relevant build coordinates in NPZ format |
ref.genome_grid.GRCm39.tsv |
Pseudo-marker grid file |
transprob_matrix_plots/*.pdf |
Validation plots showing the decay of probabilities among haplotypes by generation |
There are no optional outputs for this workflow.