Automated assembly pipeline for PacBio long read datasets.
Be sure to accept the option to automatically modify your .bashrc
or manually
update it so that your installed miniconda packages are in your PATH
.
You may also need to copy what the conda installer adds to your .bashrc
into
your .bash_profile
or source your .bashrc
from your .bash_profile
.
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
Download and install SMRT Tools (smrtcmds) from
PacBio. SMRT Link is not
required; the installer offers a SMRT Tools-only installation option. Do not
forget to put the path to the smrtcmds/bin
folder in your configuration.
auto-asm has been tested with PacBio SMRT Tools 5.1.
Clone the repository and copy the sample configuration file to a new file. A sensible location would be the root of the desired auto-asm working/output directory.
Note: auto-asm was originally developed to assemble A. thaliana genomes and the default resource configuration (cores/memory) is set with this in mind. If you are assembling significantly smaller or larger genomes, you will have to adjust these values. See the Configuration Guide below.
$ git clone https://github.com/weigelworld/auto-asm
$ cd auto-asm
$ cp config.yaml.sample config.yaml
$ vim config.yaml
$ conda env create --name "auto-asm" --file envs/auto-asm.yaml
auto-asm comes with a sample bash script (auto-asm.sh.sample
) for
running the pipeline on an SGE cluster. You may need to edit the cluster
commands in the sample script to work in your cluster environment. Also review
cluster.json
and make sure that it is appropriate for your cluster setup,
especially the cluster output and error file naming patterns.
Copy the auto-asm.sh.sample
script and edit the working directory and
config.yaml
location as needed. Run the bash script to start the auto-asm
pipeline.
$ cp auto-asm.sh.sample auto-asm.sh
$ vim auto-asm.sh
$ chmod +x auto-asm.sh
$ source activate auto-asm
$ ./auto-asm.sh
If you copied the sample configuration file, the required rule parameters are already set to their defaults. You should only need to replace the assemblies and smrtcmds_bin path but you can also change the resources and other parameters to suit your usecase. The sample configuration is a good guideline for how to format the configuration in YAML, for those unfamiliar with it.
For users that understand JSON schemas, the configuration schema is located at
schemas/config.schema.yaml
. Configuration files are validated against this
schema at runtime.
The 'assemblies' property is a list of mappings between alphanumeric assembly
ids and their configuration objects. Each assembly configuration must, at
minimum, have genome_size
, eukaryotic
, and long_read_paths
defined.
Assembly configuration options:
- name
- optional
- nicely-formatted assembly name (e.g. Col-0)
- reference
- optional
- path to reference genome in FASTA format
- genome_size
- required
- genome size in bp, with optional k, m, or g suffixes (e.g. 120m)
- eukaryotic
- required
- whether the organism is eukaryotic or not
- long_read_paths
- required
- list of paths to PacBio long read BAM files
- paired_short_read_paths
- optional
- list of pairs of paths to FASTQ files of a paired-end short read dataset