Skip to content
/ tapioca Public

Tapioca is a pipeline for Illumina Casava 1.8 genome analyzer/hiseq data.

License

Notifications You must be signed in to change notification settings

ncgr/tapioca

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tapioca

Tapioca is a pipeline for Illumina Casava 1.8 genome analyzer/hiseq data. Main features:

  • contaminant filtering
  • fastq statistical summary
  • collating/binning of casava chunks

(why tapioca? "In Brazil, the plant (cassava) is named "mandioca", while its starch is called "tapioca" https://en.wikipedia.org/wiki/Tapioca )

Depends on

Setup

In addition to the software dependencies, you'll need

  • A directory containing your Illumina sequencing instrument output.
  • Two bowtie libraries for contaminant filtering. We created one called phix and one called 'other' for adapters and primers.

Walkthrough

  • Make a new directory for the Casava & Tapioca output. Dont work in the instrument's output directory.
mkdir tap-work
cd tap-work
  • Create file samplesheet.csv. Either using Illumina's experiment manager software, or by a script to pull data from your internal LIMS. The samplesheet.csv format is described in Illumina's documentation.
  • Run Casava 1.8 to generate an Unaligned/ directory and makefile. Example:
configureBclToFastq.pl \
 --input-dir /your/instrument/output/run_flowcell/Data/Intensities/BaseCalls/ \
 --output-dir ./Unaligned \
 --sample-sheet samplesheet.csv \
 --with-failed-reads

note: It is recommended to use option --with-failed-reads, then tapioca will later separate failed chastity reads into a separate file. See Casava user's guide for other options, e.g. --use-bases-mask etc.

  • Start Casava by cd into Unaligned and running make.
cd Unaligned
make 
# or make -j [cores]
  • After Casava make finishes then configure Tapioca by running tap_configure_postprocessing. The last parameter is the Unaligned directory created by Casava 1.8. Like Casava, Tapioca uses Make for dependency tracking and job parallelism, so a makefile is the output of the configuration script.
cd ..
export PATH=/your/tapioca/bin:$PATH
tap_configure_postprocessing \
 --contam-phix-index /your/contam_libs/tapioca_phix_contam \
 --contam-phix-pct 80 \
 --contam-other-index /your/contam_libs/tapioca_other_contam \
 --contam-other-pct 20 \
 --deployed /your/deployed/dir \
 ./Unaligned
  • Now the makefile was created. First run the precheck target; it does some sanity checking on the casava run and will output some warnings if it notices anything wrong off the bat.
make precheck
  • Making the 'all' target will perform the contaminant filtering and summary reporting. Technically it is not 'all' because the deploy step is a separate target.
make all
# or make -j [cores] all

make -j 16 will use 16 cores. Alternately, the qmake script could be submitted to a SGE cluster if more parallelism is required. Qmake job submission has not been tested.

  • Now check results as necessary, in the various ./Project directories. Run make deploy when ready
make deploy

The deploy target collates all the chunks of data from the casava output into the --deployed directory. You could add more targets to the makefile to perform additional processing after the deploy is finished.

Output

Look in the Deployed directory. It should be pretty self explanatory how things are organized by subdirectory. Sorry this is not better documented.

Cleanup

There is no 'make clean' target, and please be aware the intermediate Project directories created by tapioca have uncompressed fastq files in them, and so should not be left on disk long term. Delete the directories yourself.

Authors

John Crow https://github.com/crowja , Alex Rice ([email protected])

License

# Tapioca
# Copyright (C) 2013 National Center for Genome Resources - http://ncgr.org
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

About

Tapioca is a pipeline for Illumina Casava 1.8 genome analyzer/hiseq data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published