-
Notifications
You must be signed in to change notification settings - Fork 17
PeptideMapper
PeptideMapper is a lightweight and efficient application for the mapping of amino acid sequences onto protein databases in the FASTA format. It takes advantage of the latest advances in string pattern matching to create a self-contained index that can be rapidly queried.
We have implemented queries for peptide sequences as well as sequence tags (alternating series of amino acids and mass gaps as produced by de novo sequencing algorithms).
PeptideMapper can be used standalone in command line or integrated as a dependency in other applications.
For command line options and format specifications, please refer to PeptideMapperCLI. For use as a dependency, please refer to Usage As Dependency.
The first step is to prepare the protein sequences by creating one string with the pattern: /protein1/protein2/…/proteinn/ where proteini is the ith protein sequence in the given database and '/' a separation character preventing the mapping of peptides on two consecutive proteins. Accordingly, we compute the suffix array (SA) [1] of the complete proteome using divsufsort, chosen for its performance. Subsequently, the Burrows and Wheeler transform (BWT) [2] is computed and stored in a wavelet tree (WT) data structure [3] for fast rank queries.
Only the 22 amino acid characters, the four characters for amino acid combinations and '/' complete the alphabet Ε for the WT. To save memory, the WT is constructed according to a corresponding entropy-coded tree [4] built from the number of occurrences of every character. Additionally, the accession numbers extracted from the header are stored for every protein.
We implemented an efficient function querying all character occurrences within a substring in O(σ) where σ_=|Ε|. The SA is necessary to retrieve the starting positions of the mapped peptides within a protein. To save memory, we sample the SA and can retrieve the missing entries by using the last-to-front (LF)-mapping function [5] on the BWT.
For both peptide and sequence tag mapping, we utilize the backward search [5] which finds all occurrences of a pattern P in a text T in O(p) where p = |P|.
To manage ambiguous amino acids and amino acids with indistinguishable mass, we first split peptides into their constituent amino acids and create a set of possible amino acids at each position. E.g., ATDWIRK => {A,X}{T,X}{D,B,X}{W,X}{I,L,J,X}{R,X}{K,X}. Theoretically, for this short peptide, we would need to map all possible 384 combinations, but using a dynamic programming approach we omit mapping of combinations containing prefixes we previously discarded.
The result is a set of ranges within the suffix array indicating matches that can be back calculated to the actual positions in the corresponding proteins. Having the position, we retrieve the accession number that completes all necessary data for further processing.
Sequence Tags are defined as alternating series of masses and sequences of amino acids. They can be inferred from spectra using sequence tag algorithms or de novo sequencing tools. A simple example is a tag of the following pattern m1-s-m2, where mi is a mass and s a partial peptide sequence. Having mapped s as described previously, we try to map the mass m1. Therefore, we extend s with all possible preceding sequences in the database creating a set of sequences s'.
For every sequence in s', we store a nominal mass based on the individual amino acid masses and fixed modifications provided. We compute alternative possible masses based on the variable modifications. A sequence in s' is kept when reaching m1 making a new set of sequences s", and discarded if the nominal mass exceeds m1 plus a given tolerance mass. Subsequently, we extend every sequence in s" with its possible following amino acids as done previously until reaching m2. Since this extension is conducted in the opposite direction, we also compute an index for the reversed protein sequences. Having extended s to all sequences matching the tag, all proteins it appears in as well as the corresponding starting positions are returned.
For the following benchmark tests, we are using an Intel(R) Xeon(R) 2.80 GHz quad core desktop computer with 16 GB RAM. Note that only one core is used for the performance tests, since the task of mapping independent peptides against a sequence database can be heavily parallelized, the computation time can almost be divided by the number of cores used. We compared both index methods the prior ProteinTree with the new adapted FM-Index.
Yeast (May 2016):
- size: 6,721 Proteins, 3,025,143 residues, 0 X residues
- fixed modifications: none
- variable modifications: none
- fragment tolerance: 0.02 Da
Mouse (May 2016):
- size: 16,813 Proteins, 9,474,758 residues, 79 X residues
- fixed modifications: Carbamidomethylation of C
- variable modifications: none
- fragment tolerance: 0.02 Da
Human (July 2015):
- size: 20,207 Proteins, 11,326,153 residues, 670 X residues
- fixed modifications: Carbamidomethylation of C, Oxidation of M
- variable modifications: none
- fragment tolerance: 0.02 Da
Proteogenomics (March 2015) [6]:
- size: 83,721 Proteins, 13,851,427 residues, 0 X residues
- fixed modifications: Carbamidomethylation of C, Oxidation of M, Acetylation of K
- variable modifications: Acetylation of peptide N-term, Pyrolidone from Q, Pyrolidone from E
- fragment tolerance: 0.5 Da
Metaproteomics (January 2013) [7], [8]:
- size 55,152 Proteins, 100,955,085 residues, 2,561,698 X residues
- fixed modifications: Carbamidomethylation of C
- variable modifications: Oxidation of M, Pyrolidone from Q, Acetylation of peptide N-term
- fragment tolerance: 0.2 Da
All UniProt proteomes concatenated (July 2016):
- size: 551,705 Proteins, 197,114,987 residues, 8,027 X residues
- fixed modifications: Carbamidomethylation of C, Oxidation of M
- variable modifications: Acetylation of K
- fragment tolerance: 0.02 Da
For every database listed above, benchmark data sets D = {D1, D2, D3, D4} of different sizes (|D1| = 1,000, |D2| = 10,000, |D3| = 100,000, |D4| = 1,000,000) were created by extracting peptides and tags at random positions in the database. All benchmark data (proteomes, data sets, parameter files) are available here.
Database | time [s] | size [MB] | ||
---|---|---|---|---|
ProteinTree | FM-Index | ProteinTree | FM-Index | |
Yeast | 65.73 | 2.55 | 184.30 | 7.02 |
Mouse | 212.90 | 6.57 | 601.50 | 22.02 |
Human | 252.00 | 7.49 | 537.90 | 26.33 |
Proteogenomics | 526.20 | 8.81 | 1211.00 | 32.38 |
Metaproteomics | 2956 | 58 | 5423 | 238 |
All Proteomes | 7924 | 122 | 8653 | 458 |
Database | D1 [s] | D2 [s] | D3 [s] | D4 [s] | ||||
---|---|---|---|---|---|---|---|---|
ProteinTree | FM-Index | ProteinTree | FM-Index | ProteinTree | FM-Index | ProteinTree | FM-Index | |
Yeast | 6.067 | 0.041 | 16.960 | 0.206 | 22.450 | 1.385 | 72.010 | 12.018 |
Mouse | 10.170 | 0.057 | 22.480 | 0.313 | 35.930 | 2.291 | 90.230 | 19.490 |
Human | 8.260 | 0.056 | 25.850 | 0.403 | 123.100 | 2.297 | o.o.m. | 21.350 |
Proteogenomics | 8.910 | 0.058 | 39.260 | 0.262 | 259.200 | 2.094 | o.o.m. | 17.710 |
Metaproteomics | 70.720 | 0.224 | 1183.800 | 1.547 | 9015.000 | 13.490 | o.o.m. | 134.500 |
All Proteomes | o.o.m. | 0.119 | o.o.m. | 0.643 | o.o.m. | 6.270 | o.o.m. | 62.200 |
Database | D1 [s] | D2 [s] | D3 [s] | D4 [s] | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
ProteinTree | TagRecon | FM-Index | ProteinTree | TagRecon | FM-Index | ProteinTree | TagRecon | FM-Index | ProteinTree | TagRecon | FM-Index | |
Yeast | 4.670 | 238.000 | 0.096 | 20.640 | 279.000 | 0.590 | 38.120 | 281.000 | 3.863 | 205.900 | 288.000 | 33.596 |
Mouse | 7.140 | 915.000 | 0.167 | 29.480 | 874.000 | 1.196 | 107.300 | 1002.000 | 9.910 | 871.500 | 1226.000 | 97.100 |
Human | 8.030 | 1063.000 | 0.178 | 31.760 | 1151.000 | 1.352 | 146.000 | 1328.000 | 11.670 | o.o.m. | 1608.000 | 114.400 |
Proteogenomics | 9.850 | PTMs not supported | 0.297 | 40.110 | PTMs not supported | 1.546 | 234.700 | PTMs not supported | 12.160 | o.o.m. | PTMs not supported | 119.100 |
Metaproteomics | 63.930 | PTMs not supported | 6.281 | o.o.m. | PTMs not supported | 61.370 | o.o.m. | PTMs not supported | 609.300 | o.o.m. | PTMs not supported | 6205.000 |
All Proteomes | o.o.m. | 21981.000 | 1.417 | o.o.m. | 37791.000 | 12.850 | o.o.m. | 37353.000 | 112.800 | o.o.m. | 55171.000 | 1145.000 |
To run PeptideMapper, download the packed zip archive of the latest compomics-utilities library, version 4.9.0 o newer.
wget http://genesis.ugent.be/maven2/com/compomics/utilities/4.9.0/utilities-4.9.0.jar
unzip utilities-4.9.0.jar
cd utilities-4.9.0
java -cp utilities-4.9.0.jar com.compomics.util.experiment.identification.protein_inference.executable.PeptideMapping -p exampleFiles/PeptideMapping/yeast.fasta exampleFiles/PeptideMapping/yeast-pep-1k.csv results.csv
A detailed description of the comand line instructions is available here.
Should you encounter any issue with the usage of PeptideMapper, please create a new entry in the issue tracker.
[1] Manber and Myers, 1st Annual ACM-SIAM Symposium on Discrete Algorithms, 1990
[2] Burrows and Wheeler, Technical report, Digital Equipment
Corporation, 1994
[3] Grossi et al., Proceedings of the Fourteenth Annual
ACM-SIAM Symposium on Discrete Algorithms, 2003
[4] Huffman, Proceedings of the Institute of Radio Engineers, 1952
[5] Ferragina and Manzini, Proceedings of the 41st Annual Symposium on Foundations of Computer Science, 2000
[6] Tanca et al, PLoS ONE, 2013
[7] Menschaert et al, Molecular & Cellular Proteomics, 2013
[8] Crappé et al, Nucleic Acids Research, 2014