check uniprot mapping #2

mbsimonovic · 2017-09-18T15:13:30Z

reported by a user:

I used to work with PaxDB datasets from v3 and recently I switched to v4.
Since I am using mostly Reference Proteome from Uniprot, I had to convert the PaxDB-STRING identifier to Uniprot.
I had some issues with the mapping files provided that I would like to share with you.

Inconsistencies between uniprot mapping files for PaxDB v3 and v4
However, I've noticed that there was a significant change between the two versions, when I tried to use paxdb-uniprot mapping files for at least 2 species (cerevisiae, pombe) .
pombe
4936 lines in 4896-uniprot-paxdb.v3.map
58 lines in 4896-uniprot-paxdb.v4.map
cerevisiae
6483 lines in 4932-uniprot-paxdb.v3.map
1208 lines in 4932-uniprot-paxdb.v4.map

Mapping PaxDB v4 IDs to official uniprot mapping file
So, I tried to scan all PaxDB-STRING IDs for those species in the official Uniprot mapping file (from the Uniprot FTP, see below for links).
I mostly used perl for this :

Loading into a hash the STRING IDs (after trimming the prefix which corresponds to the Taxon ID)
Checking at each line of the mapping file, if any field contained a value already present in the hash.
Surprisingly, I found much more correspondence with Uniprot than what seems to be mapped by version 4 of PaxDB :
26304 lines in PaxDB.sc_idmapping_010717.dat.scan.matched (cerevisiae)
4643 lines in PaxDB.sp_idmapping_010717.dat.scan.matched (pombe)
Obviously, multiple records from the Uniprot mapping files could match a single STRING ID but if I remove the redundant pair of STRING-UNIPROT IDs, I got :
for cerevisiae => 6440 PaxDB corresponding to 6538 Uniprot AC
for pombe => 4579 PaxDB corresponding to 4571 Uniprot AC

Checking whether STRING has same problems of mapping
I understood that PaxDB relies on STRING, which essentially performed a blast against full Uniprot to generate a mapping file.
I've checked the mapping done by STRING but it seems just fine for those species. (https://string-db.org/mapping_files/uniprot_mappings/)
5339 lines in 4896_reviewed_uniprot_2_string.04_2015.tsv
9818 lines in 4932_reviewed_uniprot_2_string.04_2015.tsv

Additionally, I noticed that STRING also had similar problem when mapping to Uniprot for at least one species (drosophila) whereas PaxDB v4 had no issues :
3486 lines in 7227_reviewed_uniprot_2_string.04_2015.tsv
36390 lines in 7227-paxdb_uniprot.txt

I am also using eggNOG and it seems that this problem of mapping to Uniprot propagated also there (at least for cerevisiae and pombe).

I still appreciate very much the great deal of work you've put in all those projects (PaxDB,STRING,eggNOG)
I hope this could contribute to correct bugs and perhaps make it accessible to a broader community.
Thanks.

PS:
Here is a preview of the matched records between PaxDB ID from v4 datasets and Uniprot Mapping file (the first column is the STRING ID that was used as a key to generate a hash with all records in perl)
==> PaxDB.sc.sc_idmapping_060717.dat.scan.matched <==
Q0045 P00401 Gene_OrderedLocusName Q0045
Q0045 P00401 EnsemblGenome Q0045
Q0045 P00401 EnsemblGenome_TRS Q0045
Q0045 P00401 EnsemblGenome_PRO Q0045
Q0050 P03875 Gene_OrderedLocusName Q0050
Q0050 P03875 EnsemblGenome Q0050
Q0050 P03875 EnsemblGenome_TRS Q0050
Q0050 P03875 EnsemblGenome_PRO Q0050
Q0055 P03876 Gene_OrderedLocusName Q0055
Q0055 P03876 EnsemblGenome Q0055

==> PaxDB.sp.sp_idmapping_060717.dat.scan.matched <==
SPAC1002.01.1 Q9US57 EnsemblGenome_TRS SPAC1002.01.1
SPAC1002.02.1 Q9US56 EnsemblGenome_TRS SPAC1002.02.1
SPAC1002.03c.1 Q9US55 EnsemblGenome_TRS SPAC1002.03c.1
SPAC1002.04c.1 Q9US54 EnsemblGenome_TRS SPAC1002.04c.1
SPAC1002.05c.1 Q9US53 EnsemblGenome_TRS SPAC1002.05c.1
SPAC1002.06c.1 Q9US52 EnsemblGenome_TRS SPAC1002.06c.1
SPAC1002.07c.1 P79081 EnsemblGenome_TRS SPAC1002.07c.1
SPAC1002.08c.1 Q9US51 EnsemblGenome_TRS SPAC1002.08c.1
SPAC1002.09c.1 O00087 EnsemblGenome_TRS SPAC1002.09c.1
SPAC1002.10c.1 Q9US49 EnsemblGenome_TRS SPAC1002.10c.1

Mapping from Uniprot FTP :
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/SCHPO_284812_idmapping.dat.gz
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/YEAST_559292_idmapping.dat.gz

mbsimonovic added the question label Sep 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check uniprot mapping #2

check uniprot mapping #2

mbsimonovic commented Sep 18, 2017

check uniprot mapping #2

check uniprot mapping #2

Comments

mbsimonovic commented Sep 18, 2017