Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate PFOCR options (strict, synonyms, all) for BTE use #778

Closed
colleenXu opened this issue Jan 31, 2024 · 8 comments
Closed

Investigate PFOCR options (strict, synonyms, all) for BTE use #778

colleenXu opened this issue Jan 31, 2024 · 8 comments

Comments

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 31, 2024

See convo in biothings/pending.api#148

Currently, the new BioThings PFOCR API with a parameter to choose between strict, synonyms, and all options is on BioThings-Translator-CI https://biothings.ci.transltr.io/pfocr

We'll want to choose what to use for BTE's regular use (x-bte annotation). And maybe think about the result-augmentation module?

Note: also consider adding pfocrUrl field to x-bte annotation for use. EDIT: see Alex Pico's advice on urls here NCATS-Tangerine/translator-api-registry#132

@colleenXu
Copy link
Collaborator Author

These PFOCR options were defined earlier:

The smallest file (suffix, "strict") only includes gene mentions that match an current, official ncbigene symbol. The next largest (suffix, "synonyms") additionally includes matches to known synonyms of individual genes. And the largest (suffix, "all") additionally includes matches to names of complexes, families and other less precise (yet common) references to one or more genes.

Depending on the use case, I might prefer one over the other. None is universally "better" than the others, imo. For BTE, I might recommend "all" since we want to maximize coverage. However, if the performance becomes an issue, the smaller sets will run faster.

My previous opinion (lab Slack link) was to use strict for BTE one-hops and for any result-augmentation module:

  • we've previously had confusion over "these figures don't actually have these genes in them"
  • we want any pfocr-related-result augmenting / grouping effort to be very performant...since it'll search 1000s of results against a good chunk of the api contents for each query maybe...

@AlexanderPico
Copy link
Collaborator

I vote for using "all" to maximize coverage and possible aggregate-based, pathway-level insights.

As @colleenXu said above, if performance becomes an issue, then we can dial back to "synonyms" and then to "strict". But these are not preferred for biological research reasons.

@AlexanderPico
Copy link
Collaborator

@ayushi-agrawal-gladstone @khanspers Please share your votes and reasoning on this issue...

@khanspers
Copy link
Collaborator

Assuming we are voting on "BTE's regular use", I agree with Alex to use "all" unless performance is an issue. But for results-augmentation module, Colleen's point about "these figures don't actually have these genes in them" makes a lot of sense and would argue for "strict" in that usage.

@ayushi-agrawal-gladstone
Copy link
Collaborator

I agree with Alex and Kristina and vote for using "all" provided there are no performance issues.

@andrewsu
Copy link
Member

great, let's go with "all" then. @everaldorodrigo what is the default if no additional parameter is provided (e.g., https://biothings.transltr.io/pfocr/query?q=associatedWith.pmc:PMC3255783)?

@everaldorodrigo
Copy link

Hi @andrewsu and everybody!

The default is all.

Reference:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants