#Oases De novo transcriptome assembler based on the Velvet genome assembler core
##Requirements Oases should function on any standard 64 bit Unix environment with a C compiler. A good amount of physical memory (12GB to start with, more is no luxury) is recommended.
##Installing Oases
> git clone --recursive https://github.com/dzerbino/oases
> cd oases
> make
> ./oases --version
0.2.09
##Compilation options
There are various compile time options that can be used:
Option | Default | Description |
---|---|---|
CATEGORIES | 2 | Maxium number of different DNA libraries |
MAXKMERLENGTH | 64 | Maximum k-mer value supported |
OPENMP | 0 | Enable OpenMP multithreading support |
BIGASSEMBLY | 0 | FIXME |
VBIGASSEMBLY | 0 | FIXME |
LONGSEQUENCES | 0 | FIXME |
SINGLE_COV_CAT | 0 | FIXME |
BUNDLEDZLIB | 0 | Use the bundled zlib 1.2.3 instead of system one |
You can apply them as in this example which changes three options:
make OPENMP=1 CATEGORIES=4 MAXKMERLENGTH=192
#Running Oases
If you are not familiar with using Velvet, read Velvet’s manual first, as many references below will point to that document.
##For impatient people
# hash reads for k=31 and k=23
> velveth directory 31,23 data/test_reads.fa
# assemble the k=31 reads
> velvetg directory_31 -read_trkg yes
> oases directory_31
# assemble the k=23 reads
> velvetg directory_23 -read_trkg yes
> oases directory_23
# merge the above assemblies with a nominal k=23
> velveth mergedAssembly 23 -long directory*/transcripts.fa
> velvetg mergedAssembly -read_trkg yes -conserveLong yes
> oases mergedAssembly -merge yes
Or use the python script oases_pipeline.py
:
> python scripts/oases_pipeline.py -m 21 -M 23 -d 'data/test_reads.fa'
##For patient people
To run Oases it is recommended to run an array of single-k assemblies, for different values of k (i.e. the hash length). These assemblies are then merged into a final assembly.
###Single-k assemblies
Each single-k assembly consists of a simple Velvet run, followed by an Oases run. As in Velvet, the hash length k is an odd integer. Velveth is run normally and velvetg is run with only one option:
> velveth directory_k k reads.fa
> velvetg directory_k -read_trkg yes
If you have strand specific reads then you have to type:
> velveth directory_k k -strand_specific reads.fa
And if they are also FASTQ files compressed with GZIP you use the regular Velvet options to change the format
> velveth directory_k k -strand_specific -short -fastq.gz reads.fastq.gz
Oases is then run on that directory:
> oases directory_k
##Oases options
In case of doubt, you can always print a help message by typing (interchangeably):
> oases
> oases --help
If using paired-end reads, Oases needs to know the expected insert length (it cannot be estimated reliably automatically). The syntax to provide insert lengths is identical to that used in Velvet:
oases directory -ins_length 500
Don't forget to also ensure the reads were hashed into a paired category
at the earlier velveth
stage!
> velveth directory_k k -shortPaired -fastq.gz -separate R1.fq.gz R2.fq.gz
Just like Velvet, low coverage contigs are removed from the assembly. Because genes span all the spectrum of expression levels and therefore coverage depths, setting the coverage cutoff does not depend on an expected or median coverage. Instead, the coverage cutoff is set to remove all the contigs whose coverage is so low that de novo assembly would be impossible anyways (by default it is set at 3x). If you wish to set it to another value, simply use the Velvet-like option:
oases directory -cov_cutoff 5
To allow more efficient error removal at high coverage, Oases integrates a dynamic correction method: if at a given node, an edge’s coverage represents less than a given percentage of the sum of coverages of edges coming out of that node, it is removed.
By default this percentage is 10% (0.1) but you can set it manually:
oases directory -edgeFractionCutoff 0.01
By default the distance estimate between two long contigs is discarded as unreliable if it is supported by less than a given number (by default 4) or read-pairs or bridging reads. If you wish to change this cutoff:
oases directory -min_pair_count 5
By setting the minimum transfrag length (by default 100bp), the user can control what is being output in the results files:
> oases directory -min_trans_lgth 200
After running the previous process for different values of k it is
necessary to merge the results of all these assemblies (contained in
transcripts.fa
files) into a single, non redundant assembly. To
realize this merge, it is necessary to choose a value of K.
Experiments have shown that K=27 works nicely for most organisms.
Higher values for K should be used if problems with missassemblies are
observed.
Assume we want to create a merged assembly in a folder called MergedAssembly
:
> velveth MergedAssembly/ K -long directory*/transcripts.fa
> velvetg MergedAssembly/ -read_trkg yes -conserveLong yes
> oases MergedAssembly/ -merge yes
Note that the transcripts.fa
files need to be given as -long
category.
With version 0.2 of Oases comes a new python script that runs the individual single-k assemblies and the new merge module. We highly recommend to use the script for the analyses, but please read the previous section for the Oases options that you need to supply to the script. However, as the script also runs velvet please consult the velvet manual for more insight. First notice the options for the script below:
python scripts/oases_pipeline.py
Options:
-h, --help show this help message and exit
-d FILES, --data=FILES
Velveth file descriptors
-p OPTIONS, --options=OPTIONS
Oases options that are passed to the command line,
e.g., -cov_cutoff 5 -ins_length 300
-m KMIN, --kmin=KMIN Minimum k
-M KMAX, --kmax=KMAX Maximum k
-s KSTEP, --kstep=KSTEP
Steps in k
-g KMERGE, --merge=KMERGE
Merge k
-o NAME, --output=NAME
Output directory prefix
-r, --mergeOnly Only do the merge
-c, --clean Clean temp files
Note that the script checks if Oases version at least 0.2.01 and Velvet at least 1.1.7 are installed on your path. Otherwise it will not work.
We will illustrate the usage of the script with the two most common scenarios single-end and paired-end reads.
First assume we want to compute a merged assembly from 21 to 35 with Oases on a single-end data set that is strand specific and retain transcripts of minimum length 100. We want to save the assemblies in folders that start with singleEnd :
python scripts/oases_pipeline.py -m 21 -M 35 -o singleEnd
-d ' -strand_specific yes data/test_reads.fa '
-p ' -min_trans_lgth 100 '
The script creates a folder named singleEnd_k for each single-k
assembly and one folder named singleEndMerged that contains the merged
transcripts for all single-k assemblies in the transcripts.fa
file.
Now assume we have paired-end reads with insert length 300 and we want to compute a merged assembly from 17 to 40. We want to save the assemblies in the folders that start with pairedEnd
python scripts/oases_pipeline.py
-m 17 -M 40 -o pairedEnd
-d ' -shortPaired data/test_reads.fa '
-p ' -ins_length 300 '
Now say we found that using k=17 was a bit too low and we want to look at a different merged assembly range. Instead of redoing all the time consuming assemblies we can re-run the merged module on a different range from 21 to 40 like this:
python scripts/oases_pipeline.py -m 21 -M 40 -r -o pairedEnd
Note that it is essential when you re-run the merged assembly for the
script to function properly to give the exact same output prefix with
the -o
parameter.
Oases produces two output files. The predicted transcript sequences can
be found in transcripts.fa
and the file contig-ordering.txt
explains
the composition of these transcripts, see below.
-
transcripts.fa
is a FASTA file containing the transcripts imputed directly from trivial clusters of contigs (loci with less than two transcripts and Confidence Values = 1) and the highly expressed transcripts imputed by dynamic programming (loci with more than 2 transcripts and Confidence Values $<$1). -
contig-ordering.txt
is a hybrid file which describes the contigs contained in each locus in FASTA format, interlaced with one line summaries of the transcripts generated by dynamic programming. Each line is a string of atoms defined as:contig_id:cumulative_length-(distance_to_next_contig)->nextitem
Here the cumulative length is the total length of the transcript assembly from its 5’ end to the 3’ end of that contig. This allows you to locate the contig sequence within the transcript sequence.
Depending on the complexity of the dataset it may happen that Velvet and Oases start to use a large amount of memory. In general, the assembly graphs for transcriptome data are smaller as compared to genome data. However, because the coverage is not even roughly uniform it depends on the sample being sequenced how much memory is used. Below we give examples of pre-processing steps that lead to a decrease in memory consumption, sorted in order of importance.
In our experience, if insert lengths are so short that paired reads overlap on their 3’ ends, i.e., if the insert length is less than twice the length, this prevents the early filtering of low coverage reads. This means that even small datasets may require huge amounts of memory. A way to avoid the problem is to join paired-end reads that overlap. One tool to do this is PEAR.
Reads with many errors create new nodes in the de Bruijn graph, all of which are later discarded by the error removal steps. Removing the errors in the reads a priori is a good way to decrease the memory consumption. A very common approach is to remove bad quality bases from the ends of read sequences or remove entire read sequences, numerous packages exist to do that but differ in functionality and definition of bad quality. One tool to do this is Trimmomatic
Note that when you use paired-end sequences and reads are discarded from
the file you need to make sure that the fasta/fastq file that you give
to velveth contains the paired-end reads in correct pairing. Also if the
read clipper retains single end reads from a pair, it is advisable to
give them as a second dataset of type -short
to avoid data loss.
Depending on your sequencing experiment, some of the reads may have adapters that were used, e.g, during library preparation. These adapters should be trimmed. Tools like Trimmomatic can also do this.
Especially for paired-end datasets it happens that the exact same read sequence is found a large number of times. One tool to do this is using Picard.
In general you need to align reads to the transcripts with your favorite aligner, one that allows for multi mappings !! Although few results exist about the difficulties and problems of quantitation of de novo transcripts’ expression levels, here are a few pointers to specialized tools:
-
eXpress - http://bio.math.berkeley.edu/eXpress/ (command line, Windows support)
-
RSEM - http://deweylab.biostat.wisc.edu/rsem/ (command line)
All the transcripts reported in a locus should have the same directionality, but that is only true if there is no contamination. Standard approaches are to use blastp and tools alike. Alternatively use gene prediction tools like GlimmerHMM, SNAP or AUGUSTUS.
For questions/requests/etc. you can subscribe to the users’ mailing list at the oases-users listserver.
##Paper
Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 2012 Apr 15;28(8):1086-92. PubMed
##Authors
- Daniel Zerbino [email protected]
- Marcel Schulz [email protected]