Skip to content

Latest commit

 

History

History
104 lines (77 loc) · 5.17 KB

README.md

File metadata and controls

104 lines (77 loc) · 5.17 KB

PaSGAL

MIT License

PaSGAL (Parallel Sequence to Graph Aligner) is designed to accelerate local sequence alignment of sequences to directed acyclic sequence graphs (DAGs), e.g., variation graphs, splicing graphs. The underlying algorithm is a parallelization of dynamic programming procedure for sequence to DAG alignment. With computing exact alignments being compute intensive, PaSGAL uses Advanced Vector Extensions (AVX) SIMD instructions and OpenMP to achieve high alignment performance on CPUs equipped with multiple cores and wide SIMD width. Given a set of query sequences (e.g., long PacBio/ONT or short Illumina reads) and a reference DAG, PaSGAL produces an highest scoring optimal local alignment for each query sequence along a path in the graph. Details about the algorithm and performance are available in our paper below.

Dependencies

  • cmake version >= 3.1
  • A C++ compiler with c++14 support, e.g., GNU g++ (version 5+) or Intel icpc (version 17+)
  • Google Protobuf library, also available using conda

Download and compile

The repository and external submodules can be downloaded using the recursive clone.

git clone --recursive <GITHUB_URL>

Next, compile the code using cmake utility:

mkdir build_directory && cd build_directory
cmake <OPTIONS> ../PaSGAL
make -j4

OPTIONS:

  1. -DPROTOBUF_DIR=<path> should provide absolute path to installation directory of google protobuf library.
  2. If avx512 feature is not available on the CPU being used, -DSIMD_SUPPORT=<avx512/avx2/none> should be specified accordingly.
  3. Cmake will automatically look for default C/C++ compilers. To modify the default selection if needed, users can set the two variables -DCMAKE_CXX_COMPILER=<path to C++ compiler> and -DCMAKE_C_COMPILER=<path to C compiler>.

After the compilation completes, expect an executable PaSGAL in your build_directory.

Usage

  • Produce help page
PaSGAL -h
  • Align a set of query sequences against a reference DAG (in .vg format):
PaSGAL -m vg -r graph.vg -q reads.fq -o outputfile -t 24
  • Align a set of query sequences against a reference DAG (in .txt format):
PaSGAL -m txt -r graph.txt -q reads.fq -o outputfile -t 24

Output file format: The output is tab-delimited with each line consisting of query id, query length, 0-based start offset, end offset, strand, reference graph start, reference graph end, alignment score and cigar string. The reference offsets are indicated as tuples of the corresponding vertex id and character offset in it.

Graph input format

PaSGAL currently accepts a DAG in two input formats: .vg and .txt. .vg is a protobuf serialized graph format, defined by VG tool developers here. .txt is a simple human readable format. The first line indicates the count of total vertices (say n). Each subsequent line contains information of vertex i, 0 <= i < n. The information in a single line conveys its zero or more out-neighbor vertex ids, followed by its non-empty DNA sequence (either space or tab separated). For example, the following graph is a directed chain of four vertices: AC (id:0) -> GT (id:1) -> GCCGT (id:2) -> CT (id:3)

4
1 AC
2 GT
3 GCCTG
CT

The first line above specifies the count of vertices as 4. The second line specifies that vertex 0 has an outgoing edge to vertex 1, and the label of vertex 0 is "AC". Similarly, the third line specifies that vertex 1 has an outgoing edge to vertex 2, and its label is "GT". The last vertex has no outgoing edge, so we just have its label.

An example run

Sample input is available in data folder to do a quick test run. Larger data sets which we used for benchmarking in the paper can be accessed here. Expect output log in the following format during execution:

$ PaSGAL -r data/BRCA1.vg -m "vg" -q data/reads.fa -t 36 -o output.txt
--------
Assert() checks     ON
AVX SIMD support    ON (AVX512)
VTUNE profiling     OFF
--------

INFO, psgl::parseandSave, reference file = data/BRCA1.vg (in vg format)
INFO, psgl::parseandSave, query file = data/reads.fa
INFO, psgl::parseandSave, output file = output.txt
INFO, psgl::parseandSave, thread count = 36
INFO, psgl::parseandSave, scoring scheme = [ match:1 mismatch:1 ins:1 del:1 ]
....
....
INFO, psgl::main, run finished

TODOs

  • Support semi-global alignment mode
  • Support affine gap penalty
  • Support .gfa input format for graphs
  • Support intra-task parallelization
  • Extend algorithm to cyclic graphs

Publication

  • Chirag Jain, Sanchit Misra, Haowen Zhang, Alexander Dilthey and Srinivas Aluru. "Accelerating Sequence Alignment to Graphs". IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2019.