DataTransform not joining some documents that belong together #127

ravila4 · 2022-01-11T21:02:03Z

I have found several documents from aeolus, unii, and ginas that belong together with documents from chembl/pubchem via primary key. For example, http://mychem.info/v1/chem/22T8Z09XAK and http://mychem.info/v1/chem/XNCKCDBPEMSUFA-UHFFFAOYSA-N both refer to the same entity and should be joined.

I think that the datatransform graph is missing some important links. This is the current graph of connections provided by MyChem's keylookup module. Note that links are missing for the drugcentral and rxnorm nodes.

.

In the example above, the two documents could be linked by a via aeolus.unii, aeolus.rxnorm or unii.unii to drugcentral.unii or drugcentral.rxnorm.

Additionally, parsers, such as Drugcentral's which perform id resolution in the parser could benefit from offloading this steps to the datatransform module. For example, this is the current code that Drugentral uses to determine the primary id for documents without inchikey:

mychem.info/src/hub/dataload/sources/drugcentral/drugcentral_parser.py

Lines 161 to 185 in e7c3247

    
           def xrefs_2_inchikey(xrefs_dict): 
        
               # Keyword list is ordered by search priority 
        
               xrefs_key_list = ['umlscui', 'chembl_id', 'pubchem_cid', 'chebi', 'drugbank_id', 'unii'] 
        
               mychem_field_dict = { 
        
                   'umlscui': 'umls.cui:"', 
        
                   'chembl_id': 'chembl.molecule_chembl_id:"', 
        
                   'pubchem_cid': 'pubchem.cid:"CID', 
        
                   'chebi': 'chebi.chebi_id:"', 
        
                   'drugbank_id': 'drugbank.accession_number:"', 
        
                   'unii': 'unii.unii:"' 
        
                   } 
        
               mychem_query = 'http://mychem.info/v1/query?q=' 
        
               results_dict = {} 
        
               results = [] 
        
               for _key in xrefs_key_list: 
        
                   if _key in xrefs_dict: 
        
                       for _xrefs in to_list(xrefs_dict[_key]): 
        
                           query_url = mychem_query + mychem_field_dict[_key] + _xrefs + '"' 
        
                           logging.info("Querying mychem.info: {}".format(query_url)) 
        
                           json_doc = requests.get(query_url).json() 
        
                           if 'hits' in json_doc and json_doc['hits']: 
        
                               for hit in json_doc['hits']: 
        
                                   logging.info("Hit: {}".format(hit['_id'])) 
        
                                   results.append(hit['_id']) 
        
               return list(set(results))

In the code above, the parser is running requests against the live MyChem database. It would be better to deal with resolution without depending on external requests.

ravila4 · 2022-03-24T20:50:01Z

Additional examples:
http://mychem.info/v1/query?q=apadamtase%20alfa - all these documents belong together, and could be joined by mapping drugname (fda_orphan_drug.generic_name) to unii.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataTransform not joining some documents that belong together #127

DataTransform not joining some documents that belong together #127

ravila4 commented Jan 11, 2022 •

edited

Loading

ravila4 commented Mar 24, 2022

DataTransform not joining some documents that belong together #127

DataTransform not joining some documents that belong together #127

Comments

ravila4 commented Jan 11, 2022 • edited Loading

ravila4 commented Mar 24, 2022

ravila4 commented Jan 11, 2022 •

edited

Loading