pretzel-input-generator
is a nextflow pipeline for generating input for pretzel from annotated and (mostly) contiguous genome assemblies.
The pipeline requires approximately 70 cpu-hours to process wheat and related genomes, but as many processes can run independently, the real run-time is much shorter if suitable compute resources are available.
Designed for EnsemblPlants and similarly formatted data.
Requires nextflow and either Singularity
nextflow run plantinformatics/pretzel-input-generator \
-profile MICROSPORIDIA,singularity --max_cpus 2 --max_memory 2.GB
This will pull and process data sets specified in conf/microsporidia.config
A mix of local and remote files can be specified - compare conf/microsporidia.config
where all the inputs are remote files and conf/test-data.config
where the same input files are expected on a local file system.
There are several paths through the pipeline which are executed depending on input specification and availability of various input file types, e.g.
- genome assembly index file
- protein sequences (required if pipeline is to generate aliases)
- marker sequences
- genome assembly fasta (required if pipeline is to place marker sequences on assemblies)
Different paths through the pipeline rely on partly different inputs
-
Generation of genome blocks requires a genome assembly index file - all we really need are lengths of pseudo-chromosomes so a two-column
.tsv
file with chromosome names and their lengths will suffice. Also, if genome assembly fasta file is specified, the index will be generated by the pipeline. -
Placement of gene features on the generated genome blocks and generation of aliases between features requires
- gene annotations (either GTF or GFF3)
- matching protein sequences (presumably for representative isoform)
If GTF/GFF3 is not available, the protein sequences FASTA id and description lines must be formatted to contain information as per the following example:
>AT1G24405.1 pep chromosome:TAIR10:1:8654945:8655662:1 gene:AT1G24405
This follows how protein sequences are annotated on Ensembl plants, but we do not currently use all the information in the description line, the complete version of which is:
>AT1G24405.1 pep chromosome:TAIR10:1:8654945:8655662:1 gene:AT1G24405 transcript:AT1G24405.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:F21J9.7 [Source:UniProtKB/TrEMBL;Acc:Q9FYM2]
- Marker placement requires full reference FASTA file.
Wherever possible the assembly files are used as input for the pipeline in their original form - as downloaded from their respective sources. This is however not always possible due to inconsistencies in formatting and varying levels of adherence to standards and conventions. We try to capture additional steps needed to prepare these input data sets for the inclusion in this pipeline in doc/format_local.md.
- nextflow
- Either of the following:
- Singularity
- Docker
- Required software installed. In addition to standard linux tools, these include:
- FASTX-Toolkit
- MMSeqs2 - if generating aliases
- Minimap2 - if placing markers
jq
groovy
interpreter- and who knows what else - try to stick to either docker or singularity
When using Singularity or Docker, the required containers are specified in conf/containers.conf
and pulled by Nextflow as required, if singularity fails when trying to pull multiple container images simultaneously, run
nextflow run pull_containers.nf -profile singularity
which will pull the container images sequentially.
We provide several execution profiles, "locally" may mean a designated server or an interactive session on a cluster. By appending e.g. -revision v2.0
to your command you can specify a release tag to run a specific revision of the pipeline. When re-running the pipeline after errors or changes use -resume
to ensure only the necessary processes are re-run.
Run locally with docker
nextflow run plantinformatics/pretzel-input-generator \
-profile MICROSPORIDIA,docker
Run locally with singularity
nextflow run plantinformatics/pretzel-input-generator \
-profile MICROSPORIDIA,singularity
Dispatch on a SLURM cluster with singularity
nextflow run plantinformatics/pretzel-input-generator \
-profile MICROSPORIDIA,slurm,singularity
All generated JSON files generated by the pipeline are output to results/JSON
.
- For each of the input genome assemblies, these include:
*_genome.json
- dataset (genome) definitions specifying outer coordinates of blocks (chromosomes)*_annotation.json.gz
- specifications of coordinates of features (genes) within blocks
- In addition, for each (lexicographically ordered) pair of genome assemblies, the pipeline generates:
*_aliases.json.gz
which specify links between features between the two genomes.
*_{markers,transcripts,cds,genomic}.json.gz
- placement of marker or other sequences as features within blocks
The output files (hopefully) conform to the requirements of pretzel data structure.
The results/flowinfo
directory contains summaries of pipeline execution and results/downloads
includes the files downloaded from Ensembl plants.
results
├── downloads
├── flowinfo
├── summary
└── JSON
To upload the generated data to your instance of pretzel, follow these instructions.
nextflow run pretzel-input-generator/main.nf -profile TRITICEAE,singularity,slurm -resume
N E X T F L O W ~ version 20.01.0
Launching `pretzel-input-generator/main.nf` [intergalactic_saha] - revision: a6228acc01
[62/a89524] process > alignToGenome [100%] 9 of 9, cached: 8 ✔
[e7/98631c] process > generateFeaturesFromSeqAlignmentsJSON [100%] 9 of 9, cached: 8 ✔
[- ] process > faidxAssembly -
[d5/8e5071] process > generateGenomeBlocksJSON [100%] 13 of 13, cached: 12 ✔
[30/e61191] process > filterForRepresentativePeps [100%] 8 of 8, cached: 8 ✔
[a9/7b88d4] process > convertReprFasta2EnsemblPep [100%] 13 of 13, cached: 11 ✔
[9c/fad139] process > generateFeaturesJSON [100%] 21 of 21, cached: 19 ✔
[be/3f4613] process > pairProteins [100%] 231 of 231, cached: 157 ✔
[32/8c206e] process > generateAliasesJSON [100%] 231 of 231, cached: 157 ✔
[db/645b66] process > stats [100%] 1 of 1 ✔
[9b/d97986] process > pack [100%] 1 of 1 ✔
Completed at: 20-Mar-2020 22:47:45
Duration : 19m 39s
CPU hours : 67.9 (38.7% cached)
Succeeded : 157
Cached : 380