Skip to content

plantinformatics/pretzel-input-generator

Repository files navigation

Latest GitHub tag

GitHub commits since latest release

GitHub Workflow Status

Pipeline overview

pretzel-input-generator is a nextflow pipeline for generating input for pretzel from annotated and (mostly) contiguous genome assemblies. The pipeline requires approximately 70 cpu-hours to process wheat and related genomes, but as many processes can run independently, the real run-time is much shorter if suitable compute resources are available.

Default pipeline

Designed for EnsemblPlants and similarly formatted data.

doc/dag.png

Quick start example using microsporidia data

Requires nextflow and either Singularity

nextflow run plantinformatics/pretzel-input-generator \
-profile MICROSPORIDIA,singularity --max_cpus 2 --max_memory 2.GB 

This will pull and process data sets specified in conf/microsporidia.config

Input specification

A mix of local and remote files can be specified - compare conf/microsporidia.config where all the inputs are remote files and conf/test-data.config where the same input files are expected on a local file system.

There are several paths through the pipeline which are executed depending on input specification and availability of various input file types, e.g.

  • genome assembly index file
  • protein sequences (required if pipeline is to generate aliases)
  • marker sequences
  • genome assembly fasta (required if pipeline is to place marker sequences on assemblies)

Different paths through the pipeline rely on partly different inputs

  1. Generation of genome blocks requires a genome assembly index file - all we really need are lengths of pseudo-chromosomes so a two-column .tsv file with chromosome names and their lengths will suffice. Also, if genome assembly fasta file is specified, the index will be generated by the pipeline.

  2. Placement of gene features on the generated genome blocks and generation of aliases between features requires

  • gene annotations (either GTF or GFF3)
  • matching protein sequences (presumably for representative isoform)

If GTF/GFF3 is not available, the protein sequences FASTA id and description lines must be formatted to contain information as per the following example:

>AT1G24405.1 pep chromosome:TAIR10:1:8654945:8655662:1 gene:AT1G24405

This follows how protein sequences are annotated on Ensembl plants, but we do not currently use all the information in the description line, the complete version of which is:

>AT1G24405.1 pep chromosome:TAIR10:1:8654945:8655662:1 gene:AT1G24405 transcript:AT1G24405.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:F21J9.7 [Source:UniProtKB/TrEMBL;Acc:Q9FYM2]
  1. Marker placement requires full reference FASTA file.

Disparate triticeae datasets

Wherever possible the assembly files are used as input for the pipeline in their original form - as downloaded from their respective sources. This is however not always possible due to inconsistencies in formatting and varying levels of adherence to standards and conventions. We try to capture additional steps needed to prepare these input data sets for the inclusion in this pipeline in doc/format_local.md.

Dependencies

  • nextflow
  • Either of the following:
    • Singularity
    • Docker
    • Required software installed. In addition to standard linux tools, these include:
      • FASTX-Toolkit
      • MMSeqs2 - if generating aliases
      • Minimap2 - if placing markers
      • jq
      • groovy interpreter
      • and who knows what else - try to stick to either docker or singularity

When using Singularity or Docker, the required containers are specified in conf/containers.conf and pulled by Nextflow as required, if singularity fails when trying to pull multiple container images simultaneously, run

nextflow run pull_containers.nf -profile singularity 

which will pull the container images sequentially.

Execution

We provide several execution profiles, "locally" may mean a designated server or an interactive session on a cluster. By appending e.g. -revision v2.0 to your command you can specify a release tag to run a specific revision of the pipeline. When re-running the pipeline after errors or changes use -resume to ensure only the necessary processes are re-run.

Run locally with docker

nextflow run plantinformatics/pretzel-input-generator \
-profile MICROSPORIDIA,docker 

Run locally with singularity

nextflow run plantinformatics/pretzel-input-generator \
-profile MICROSPORIDIA,singularity 

Dispatch on a SLURM cluster with singularity

nextflow run plantinformatics/pretzel-input-generator \
-profile MICROSPORIDIA,slurm,singularity

Output

All generated JSON files generated by the pipeline are output to results/JSON.

  • For each of the input genome assemblies, these include:
    • *_genome.json - dataset (genome) definitions specifying outer coordinates of blocks (chromosomes)
    • *_annotation.json.gz - specifications of coordinates of features (genes) within blocks
  • In addition, for each (lexicographically ordered) pair of genome assemblies, the pipeline generates:
    • *_aliases.json.gz which specify links between features between the two genomes.
  • *_{markers,transcripts,cds,genomic}.json.gz - placement of marker or other sequences as features within blocks

The output files (hopefully) conform to the requirements of pretzel data structure.

The results/flowinfo directory contains summaries of pipeline execution and results/downloads includes the files downloaded from Ensembl plants.

results
├── downloads
├── flowinfo
├── summary
└── JSON

To upload the generated data to your instance of pretzel, follow these instructions.

Example

nextflow run pretzel-input-generator/main.nf -profile TRITICEAE,singularity,slurm -resume
N E X T F L O W  ~  version 20.01.0
Launching `pretzel-input-generator/main.nf` [intergalactic_saha] - revision: a6228acc01
[62/a89524] process > alignToGenome                         [100%] 9 of 9, cached: 8 ✔
[e7/98631c] process > generateFeaturesFromSeqAlignmentsJSON [100%] 9 of 9, cached: 8 ✔
[-        ] process > faidxAssembly                         -
[d5/8e5071] process > generateGenomeBlocksJSON              [100%] 13 of 13, cached: 12 ✔
[30/e61191] process > filterForRepresentativePeps           [100%] 8 of 8, cached: 8 ✔
[a9/7b88d4] process > convertReprFasta2EnsemblPep           [100%] 13 of 13, cached: 11 ✔
[9c/fad139] process > generateFeaturesJSON                  [100%] 21 of 21, cached: 19 ✔
[be/3f4613] process > pairProteins                          [100%] 231 of 231, cached: 157 ✔
[32/8c206e] process > generateAliasesJSON                   [100%] 231 of 231, cached: 157 ✔
[db/645b66] process > stats                                 [100%] 1 of 1 ✔
[9b/d97986] process > pack                                  [100%] 1 of 1 ✔
Completed at: 20-Mar-2020 22:47:45
Duration    : 19m 39s
CPU hours   : 67.9 (38.7% cached)
Succeeded   : 157
Cached      : 380