Yet Another Amplicon denoising Pipeline (YAAP), is a pipeline to analyse metabarcoding amplicon data. It performs QC, adapter removal, denoising and ZOTU table construction
This pipeline makes use of the following programs:
- cutadapt (https://cutadapt.readthedocs.io/en/stable/index.html)
- vsearch (https://github.com/torognes/vsearch)
- seqkit (https://bioinf.shenwei.me/seqkit/)
- pear (https://cme.h-its.org/exelixis/web/software/pear/)
- usearch (https://www.drive5.com/usearch/)
- fastqc (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
This git contains the binaries (executables) of pear, seqkit, and usearch, however, vsearch and cutadapt have to be installed locally.
All the instructions here are only tested in linux systems. You might have to adjust part of it for other operating systems.
I recently added a bash script to install all dependencies for you. It might
not work in all systems, but it might be worth given a try. This script assumes
that you have python3 in your environment ans can be executed as python. If you
are in a Compute canada cluster, you need to load the scipy-stack/2018b
module. So let's assume that you want me to walk you though all the
installation. Then you need to:
- Clone this repository (you need to have git installed):
git clone https://github.com/CristescuLab/YAAP.git
- Get into the folder:
cd YAAP
-
Because of the licence in usearch I am not allowed to distribute the binary. So you will need to visit https://drive5.com/usearch/download.html download the latest LINUX version from your email, and place it in this folder (the YAAP folder)
-
Execute the
install_dependencies.sh
code:
bash install_dependencies.sh <name of usearch binary>
You will need to change <name of usearch binary>
for the actual name of the
binary you downloaded from usearch
5. Test that all dependencies work:
cutadapt -h
vsearch -h
pear -h
seqkit -h
usearch
If all of the above commands do not give you errors, you are good to go!! If you still have errors, try:
source ~/.bashrc
And then, try again.
- cutadapt: Follow the instructions at https://cutadapt.readthedocs.io/en/stable/installation.html. If you are using a computer cluster I advice you AGAINST installing cutadapt with conda.
- vsearch: Follow the instruction at https://github.com/torognes/vsearch
For the rest of the dependencies, you need to point your path to the executable directory or copy/link the executables to your bin directory. Alternatively, you can install them from scratch in your system.
If you don't have administrator privileges, go to the next section.
If you do, you can just copy (cp
) or symbolically link (ln -s
).
For example, let's say that you cloned YAAP to your home directory and let's
assume that your username is user1
. If you are in a linux system you can just
type:
cd /home/user1/YAAP/executables_linux_64/
sudo ln -s * /usr/local/bin
Then pear, seqkit, and usearch are available to the pipeline.
If you don't have administrator privileges, you can let the system know to look
for the executables in the appropriate folder. As before, let's assume that
your username is user1
and that you clone the repository in home. You can
export the path by typing:
cd /home/user1/YAAP/executables_linux_64/
echo "export PATH=$PATH:$PWD" >> ~/.bashrc
source ~/.bashrc
YAAP has a single bash script called ASV_pipeline.sh. It requires 8 arguments:
- File with a list of fileset prefix (one per line)
- Forward primer sequence
- Reverse primer sequence
- Primer name
- Prefix of the output
- Number of cpus to use
- Minimum amplicon length
- Maximum amplicon length
- Unoise minimum abundance parameter
This pipeline assumes that your files are names fileset_prefix_R1.fastq.gz and fileset_prefix_R2.fastq.gz for all the samples. It also assumes that you have demultiplexed your samples.
As an example, let's assume that we have two samples named MC1 and MC2. Your
demultiplexed fastq files are MI.M03555_0320.001.N701N517.MC1_R1.fastq.gz
,
MI.M03555_0320.001.N701N517.MC1_R2.fastq.gz
,
MI.M03555_0320.001.N702N517.MC2_R1.fastq.gz
,
MI.M03555_0320.001.N702N517.MC2_R1.fastq.gz
.
You will need to create a file that looks like this:
MI.M03555_0320.001.N701N517.MC1
MI.M03555_0320.001.N702N517.MC2
Alternatively, you can also create the with the path where the file sets are.
Let's assume that you called this file file_list.txt
. Let's also assumed that
you are working with the Leray fragments flanked by
GGWACWGGWTGAACWGTWTAYCCYC
as forward, and TAAACTTCAGGGTGACCAAAAAATCA
as
reverse, and in a machine that have 28 cpus. You can run the pipeline as:
bash ~/YAAP/ASV_pipeline.sh file_list.txt \
GGWACWGGWTGAACWGTWTAYCCYC TAAACTTCAGGGTGACCAAAAAATCA \
COI test 28 293 333 4
This call of the pipeline will focus on reads that contained the primer
sequences, that were longer than 126 (we focused on reads that are longer than
(min_len/2) - 20) in either direction (forwards and reverse), and which
assembly shoud be between 293bp and 333 (around expected length of the Leray
amplicon). The denoising would filter out reads with lower abundances than 4.
It will create an output folder with the name of the primer (COI in
the example) and append test to the output files. In the example, the output
folder will be called test_outputCOI
.
If you find any bugs related with the pipeline, please open an issue in the the github repo, and add some traceback (error) information.