Tapioca

Tapioca is a pipeline for Illumina Casava 1.8 genome analyzer/hiseq data. Main features:

contaminant filtering
fastq statistical summary
collating/binning of casava chunks

(why tapioca? "In Brazil, the plant (cassava) is named "mandioca", while its starch is called "tapioca" https://en.wikipedia.org/wiki/Tapioca )

Depends on

Casava 1.8 http://support.illumina.com/sequencing/sequencing_software/casava.ilmn
Bowtie2 http://bowtie-bio.sourceforge.net/bowtie2/
fqutils https://github.com/crowja/fqutils
tpipe http://www.eurogaran.com/index.php/es/component/remository/tpipe/ (also see Unix Power Tools http://shop.oreilly.com/product/9780596003302.do)
Make, gzip, just standard Linux utilties
Perl5 http://perl.org
Perl modules (many of which are already in your perl distro) Bio::SeqReader::Fastq; Cwd; File::Basename; File::Path; File::Spec; File::Which; Getopt::Long; IO::File; IO::Uncompress::AnyUncompress; IO::Uncompress::Gunzip Readonly; Term::ANSIColor; XML::Simple;

Setup

In addition to the software dependencies, you'll need

A directory containing your Illumina sequencing instrument output.
Two bowtie libraries for contaminant filtering. We created one called phix and one called 'other' for adapters and primers.

Walkthrough

Make a new directory for the Casava & Tapioca output. Dont work in the instrument's output directory.

mkdir tap-work
cd tap-work

Create file samplesheet.csv. Either using Illumina's experiment manager software, or by a script to pull data from your internal LIMS. The samplesheet.csv format is described in Illumina's documentation.
Run Casava 1.8 to generate an Unaligned/ directory and makefile. Example:

configureBclToFastq.pl \
 --input-dir /your/instrument/output/run_flowcell/Data/Intensities/BaseCalls/ \
 --output-dir ./Unaligned \
 --sample-sheet samplesheet.csv \
 --with-failed-reads

note: It is recommended to use option --with-failed-reads, then tapioca will later separate failed chastity reads into a separate file. See Casava user's guide for other options, e.g. --use-bases-mask etc.

Start Casava by cd into Unaligned and running make.

cd Unaligned
make 
# or make -j [cores]

After Casava make finishes then configure Tapioca by running tap_configure_postprocessing. The last parameter is the Unaligned directory created by Casava 1.8. Like Casava, Tapioca uses Make for dependency tracking and job parallelism, so a makefile is the output of the configuration script.

cd ..
export PATH=/your/tapioca/bin:$PATH
tap_configure_postprocessing \
 --contam-phix-index /your/contam_libs/tapioca_phix_contam \
 --contam-phix-pct 80 \
 --contam-other-index /your/contam_libs/tapioca_other_contam \
 --contam-other-pct 20 \
 --deployed /your/deployed/dir \
 ./Unaligned

Now the makefile was created. First run the precheck target; it does some sanity checking on the casava run and will output some warnings if it notices anything wrong off the bat.

make precheck

Making the 'all' target will perform the contaminant filtering and summary reporting. Technically it is not 'all' because the deploy step is a separate target.

make all
# or make -j [cores] all

make -j 16 will use 16 cores. Alternately, the qmake script could be submitted to a SGE cluster if more parallelism is required. Qmake job submission has not been tested.

Now check results as necessary, in the various ./Project directories. Run make deploy when ready

make deploy

The deploy target collates all the chunks of data from the casava output into the --deployed directory. You could add more targets to the makefile to perform additional processing after the deploy is finished.

Output

Look in the Deployed directory. It should be pretty self explanatory how things are organized by subdirectory. Sorry this is not better documented.

Cleanup

There is no 'make clean' target, and please be aware the intermediate Project directories created by tapioca have uncompressed fastq files in them, and so should not be left on disk long term. Delete the directories yourself.

Authors

John Crow https://github.com/crowja , Alex Rice ([email protected])

License

# Tapioca
# Copyright (C) 2013 National Center for Genome Resources - http://ncgr.org
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MANIFEST		MANIFEST
README.md		README.md
mkbundle		mkbundle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tapioca

Depends on

Setup

Walkthrough

Output

Cleanup

Authors

License

About

Releases

Packages

License

ncgr/tapioca

Folders and files

Latest commit

History

Repository files navigation

Tapioca

Depends on

Setup

Walkthrough

Output

Cleanup

Authors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages