Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BioThings repoDB parser changes #169

Closed
colleenXu opened this issue Jan 10, 2024 · 1 comment
Closed

BioThings repoDB parser changes #169

colleenXu opened this issue Jan 10, 2024 · 1 comment
Labels
On CI Match https://github.com/biothings/biothings_explorer/labels

Comments

@colleenXu
Copy link

colleenXu commented Jan 10, 2024

PRIORITY: medium. It'd be useful to have for the upcoming biolink-model refactor ("treats"). Higher in priority than #170

While writing the SmartAPI yaml w/ x-bte annotation for BioThings repoDB, I noticed some issues.

After discussion with Andrew yesterday, we agreed that these changes should be made:

  1. changing the parser to create association-centric data (unique combos of drug-disease-status) rather than drug-centric (current) would be helpful, particularly for the upcoming "treats" refactor. I wrote more about the problems with the current data structure in the linked issue Data source: repoDB  #77 (comment)
Mockup of what association-centric data may look like

Right now, there's 1 record for the drug Rituximab.

It'd be transformed into multiple records, 1 for each combo of rituximab + unique disease + unique status.

So for rituximab + "Lymphoma, Non-Hodgkin" C0024305, there'd be 3 records (3 diff statuses). I didn't include all the info for the "Terminated" record since there's currently 18 objects/clinical-trials in the data.

[
  {
    "drug_drugbank_id": "DB00073",
    "drug_name": "rituximab",
    "indication_umls": "C0024305",
    "indication_name": "Lymphoma, Non-Hodgkin",
    "status": "Approved"
  },
  {
    "drug_drugbank_id": "DB00073",
    "drug_name": "rituximab",
    "indication_umls": "C0024305",
    "indication_name": "Lymphoma, Non-Hodgkin",
    "status": "Terminated",
    "clinical_trial_info": [
      {
        "NCT": "NCT00057343",
        "phase": "Phase 3"
      },
      {
        "NCT": "NCT00057447",
        "detailed_status": "administrative reasons",
        "phase": "Phase 1/Phase 2"
      },
      ....
    ]
  },
  {
    "drug_drugbank_id": "DB00073",
    "drug_name": "rituximab",
    "indication_umls": "C0024305",
    "indication_name": "Lymphoma, Non-Hodgkin",
    "status": "Withdrawn",
    "clinical_trial_info": [
      {
        "NCT": "NCT02408042",
        "phase": "Phase 1/Phase 2"
      }
    ]
  }
]

  1. Figure out what the field value "NA" means. If it basically means "not available/applicable", I'd find it helpful if the parser removed the fields with "NA" values. That way BTE would be able to use this field without post-processing to remove "NA".
"NA" is a common value for these fields

  1. Double-check whether this API is using the latest data from repoDB (v2.1 2023-06-15) in the version history section of the repodb website). Based on the metadata endpoint, it might be using the latest data. But the original development and deployment was in 2022 before that data release.
@everaldorodrigo everaldorodrigo added the On CI Match https://github.com/biothings/biothings_explorer/labels label Jan 31, 2024
@colleenXu
Copy link
Author

I think this issue has been addressed, so I'm closing it. I noted that all instances of the APIs were updated here. There were also detailed discussions in the lab Slack (one thread here that ended with all changes agreed on and deployed to CI).

@everaldorodrigo I suggest adding links to the PRs/code changes related to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
On CI Match https://github.com/biothings/biothings_explorer/labels
Projects
None yet
Development

No branches or pull requests

2 participants