Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataTransform not joining some documents that belong together #127

Open
ravila4 opened this issue Jan 11, 2022 · 1 comment
Open

DataTransform not joining some documents that belong together #127

ravila4 opened this issue Jan 11, 2022 · 1 comment

Comments

@ravila4
Copy link
Contributor

ravila4 commented Jan 11, 2022

I have found several documents from aeolus, unii, and ginas that belong together with documents from chembl/pubchem via primary key. For example, http://mychem.info/v1/chem/22T8Z09XAK and http://mychem.info/v1/chem/XNCKCDBPEMSUFA-UHFFFAOYSA-N both refer to the same entity and should be joined.

I think that the datatransform graph is missing some important links. This is the current graph of connections provided by MyChem's keylookup module. Note that links are missing for the drugcentral and rxnorm nodes.

mychem_graph
.

In the example above, the two documents could be linked by a via aeolus.unii, aeolus.rxnorm or unii.unii to drugcentral.unii or drugcentral.rxnorm.

Additionally, parsers, such as Drugcentral's which perform id resolution in the parser could benefit from offloading this steps to the datatransform module. For example, this is the current code that Drugentral uses to determine the primary id for documents without inchikey:

def xrefs_2_inchikey(xrefs_dict):
# Keyword list is ordered by search priority
xrefs_key_list = ['umlscui', 'chembl_id', 'pubchem_cid', 'chebi', 'drugbank_id', 'unii']
mychem_field_dict = {
'umlscui': 'umls.cui:"',
'chembl_id': 'chembl.molecule_chembl_id:"',
'pubchem_cid': 'pubchem.cid:"CID',
'chebi': 'chebi.chebi_id:"',
'drugbank_id': 'drugbank.accession_number:"',
'unii': 'unii.unii:"'
}
mychem_query = 'http://mychem.info/v1/query?q='
results_dict = {}
results = []
for _key in xrefs_key_list:
if _key in xrefs_dict:
for _xrefs in to_list(xrefs_dict[_key]):
query_url = mychem_query + mychem_field_dict[_key] + _xrefs + '"'
logging.info("Querying mychem.info: {}".format(query_url))
json_doc = requests.get(query_url).json()
if 'hits' in json_doc and json_doc['hits']:
for hit in json_doc['hits']:
logging.info("Hit: {}".format(hit['_id']))
results.append(hit['_id'])
return list(set(results))

In the code above, the parser is running requests against the live MyChem database. It would be better to deal with resolution without depending on external requests.

@ravila4
Copy link
Contributor Author

ravila4 commented Mar 24, 2022

Additional examples:
http://mychem.info/v1/query?q=apadamtase%20alfa - all these documents belong together, and could be joined by mapping drugname (fda_orphan_drug.generic_name) to unii.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant