Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated Ensembl IDs #23

Closed
floatingpurr opened this issue Oct 26, 2018 · 5 comments
Closed

Duplicated Ensembl IDs #23

floatingpurr opened this issue Oct 26, 2018 · 5 comments

Comments

@floatingpurr
Copy link

Hi guys! I am opening this issue to notify a potential problem that I found in data.

According to this query:

SELECT ?item ?itemLabel ?item2 ?item2Label 
WHERE 
{
  ?item wdt:P594 ?ensg .
  ?item2 wdt:P594 ?ensg .
  FILTER (str(?item) > str(?item2))
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

There are some Ensembl IDs re used across items. It sounds pretty strange.

For example, Q413766 is Fibronectin 1 protein, and Q14819473 is its encoding gene. Both items share
?item wdt:P594 'ENSG00000115414'. AFAIK, ENSG* should be reserved to genes.

Is there something to check in data loading process?

PS: guys at SuLab, please don't hate me too much for my issues submissions 😃

@stuppie
Copy link
Contributor

stuppie commented Oct 29, 2018

No problem Andrea, thanks for pointing these out!
I haven't looked at all the instances, but at least in this case in particular, it looks like it was added by Tobias1984 in 2013 (link).
Theres 54 cases, so I'm guessing its a combination of merges and old statements.. Will take a look at the others

@stuppie
Copy link
Contributor

stuppie commented Oct 29, 2018

Looks like some of the others are cases in which a gene has two different Entrez IDs but Ensembl calls it the same gene.
Example: https://www.wikidata.org/wiki/Q21821399 https://www.wikidata.org/wiki/Q27107877

@floatingpurr
Copy link
Author

I see. The latter is the same case of #19

Regarding Fibronectin-like cases, namely proteins with an ENSG*, I tried the get them all with the following query:

SELECT distinct ?item ?itemLabel
WHERE 
{
  ?item wdt:P594 ?ensg .
  ?item wdt:P31|wdt:P279 wd:Q8054 .
  FILTER NOT EXISTS {?item wdt:P31|wdt:P279 wd:Q7187}
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

Turned out there are 2 proteins (i.e., Fibronectin 1 and Myoglobin) with both an ENSG* and an ENSMUS* identifier as Ensembl Gene ID. The user you mentioned inserted those statements in 2013. The solution is removing those 4 statements.

Regarding CRIP1, someone at 213.96.40.12 marked the item as a protein, but it looks not correct to me, since this is a gene. The solution is removing the statement "instance of protein"

If you agree, I'd proceed as suggested.

@stuppie
Copy link
Contributor

stuppie commented Oct 30, 2018

Looks good to me, thanks

@floatingpurr
Copy link
Author

You are welcome!

I've just updated those 3 items. Strangely, this query:

SELECT distinct ?item ?itemLabel
WHERE 
{
  ?item wdt:P594 ?ensg .
  ?item wdt:P31|wdt:P279 wd:Q8054 .
  FILTER NOT EXISTS {?item wdt:P31|wdt:P279 wd:Q7187}
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

still returns old results. Probably something similar to #22 is going on.

I'm going to close this issue and to report the problem with stale data in Phabricator.

Thanks! 🤙

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants