README

SWIPE
Smith-Waterman database searches with Inter-sequence Parallel Execution

Copyright (C) 2008-2021 Torbjorn Rognes, University of Oslo,
Oslo University Hospital and Sencel Bioinformatics AS

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

Contact: Torbjorn Rognes <torognes@ifi.uio.no>, 
Department of Informatics, University of Oslo, 
PO Box 1080 Blindern, NO-0316 Oslo, Norway

--

SWIPE version 2 incorporates a number of new features compared to the initial
release, most notably hit statistics, alignments, as well as support for
nucleotide sequences, large databases and MPI.

The software is distributed from http://github.com/torognes/swipe
Older versions are also available from http://dna.uio.no/swipe/

Binary executables are available for Linux and Mac (both 64-bit).

SWIPE is distributed under the GNU Affero General Public License, version 3.
See the file COPYING for details. SWIPE includes some public domain code for
computing alignment statistics extracted from the BLAST software by the
National Center for Biotechnology Information (NCBI).

Please cite: Rognes T (2011) Faster Smith-Waterman database searches
with inter-sequence SIMD parallelisation. BMC Bioinformatics 12, 221.

Usage: ./[mpi]swipe [OPTIONS]
  -h, --help                 show help
  -d, --db=FILE              sequence database base name (required)
  -i, --query=FILE           query sequence filename (stdin)
  -M, --matrix=NAME/FILE     score matrix name or filename (BLOSUM62)
  -q, --penalty=NUM          penalty for nucleotide mismatch (-3)
  -r, --reward=NUM           reward for nucleotide match (1)
  -G, --gapopen=NUM          gap open penalty (11)
  -E, --gapextend=NUM        gap extension penalty (1)
  -v, --num_descriptions=NUM sequence descriptions to show (250)
  -b, --num_alignments=NUM   sequence alignments to show (100)
  -e, --evalue=REAL          maximum expect value of sequences to show (10.0)
  -k, --minevalue=REAL       minimum expect value of sequences to show (0.0)
  -c, --min_score=NUM        minimum score of sequences to show (1)
  -u, --max_score=NUM        maximum score of sequences to show (inf.)
  -a, --num_threads=NUM      number of threads to use [1-256] (1)
  -m, --outfmt=NUM           output format [0,7-9=plain,xml,tsv,tsv+] (0)
  -I, --show_gis             show gi numbers in results (no)
  -p, --symtype=NAME/NUM     symbol type/translation [0-4] (1)
  -S, --strand=NAME/NUM      query strands to search [1-3] (3)
  -Q, --query_gencode=NUM    query genetic code [1-23] (1)
  -D, --db_gencode=NUM       database genetic code [1-23] (1)
  -x, --taxidlist=FILE       taxid list filename (none)
  -N, --dump=NUM             dump database [0-2=no,yes,split headers] (0)
  -H, --show_taxid           show taxid etc in results (no)
  -o, --out=FILE             output file (stdout)
  -z, --dbsize=NUM           set effective database size (0)

Defaults are indicated in parentheses in the option list above. NUM means an
integer, while REAL means a floating point number (possibly in scientific
notation). NAME is a string or integer, while FILE is a filename. For databases,
FILE is the base name of the database files.

Most features and options are similar and compatible with NCBI
BLAST.

All scores (e.g. for options -c and -u) are raw alignment scores
(not bit scores) unless otherwise noted.

Please note that SWIPE does not employ composition based statistics (BLAST
option "-C"), but uses the standard Karin-Altschul statistics. Also, sequence
filtering of low complexity regions (BLAST option "-F") is not implemented.
Please pre-filter your query. SWIPE does not use the new sequence length
adjustment statistics (FSC) introduced in BLAST 2.2.26.

The BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90, PAM30, PAM70 and PAM250
matrices are "built-in". Other matrices may be specified with a file name.

The query and database sequence type is specified with the "-p" option using
either a numeric or text argument according to the table below:

Numeric String  Query  Database Comparisons
0       blastn  NT     NT       Direct + reverse complementary
1       blastp  AA     AA       Direct
2       blastx  NT     AA       Translated query (6 frames)
3       tblastn AA     NT       Translated database (6 frames)
4       tblastx NT     NT       Translated query and database (6x6 frames)

A plain text file with a list of taxid numbers, one per line, may be
specified with the "-x" option to search only sequences from the
specified organisms.

SWIPE accepts only databases prepared using the formatdb or makeblastdb
tools that are distributed together with NCBI BLAST and BLAST+. They
can be downloaded at ftp://ftp.ncbi.nlm.nih.gov/blast/executables/
from NCBI. For makeblastdb, please use the "-blastdb_version 4" option
to make databases that are compatible with SWIPE.

The SWIPE distribution includes executable binaries for 64-bit Linux and
Mac. SWIPE may be compiled from sources using either the GNU g++ compiler,
the LLVM Clang compiler or the Intel icpc compiler. SWIPE is marginally
faster when compiled with the Intel compiler than with the GNU compiler.

SWIPE will only run on processors with the SSE2 instruction set and
runs best on processors with the SSSE3 instruction set. Please see
http://en.wikipedia.org/wiki/SSSE3 for a list of processors with this
instruction set.

The enclosed file scoring.pdf contains a table with the scoring system
parameters that will enable E-value statistics to be computed.

Examples of how to run SWIPE:

To run SWIPE with one or more protein sequence queries in the
FASTA-formatted file named "query.fsa" against the protein
database with base name "swissprot" using 8 threads and an expect
value threshold of 0.001:

./swipe -p 1 -i query.fsa -d swissprot -e 0.001 -a 8

To run SWIPE with a nucleotide sequence query in the file "primer.fsa"
against the nucleotide database file with base name "ecoli" using 1
thread and the default expect value of 10:

./swipe -p 0 -i primer.fsa -d ecoli

To search only the Human entries in the "est" database using a
translated search with the protein query sequence in the file
"prot.fsa":

echo 9606 > human.tax
./swipe -p 3 -i prot.fsa -d est -x human.tax