-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case-sensitive handling of stopwords and overrides #680
Comments
This would take some engineering to fix... |
What I am suggesting is a bit more specific than making everything case sensitive. Namely, this would only apply in cases where there is ambiguity between two choices, i.e., a grounding entry and either a stopword or an explicit override with different capitalization. If there is no ambiguity with either of these entry types then case-insensitive matching would continue to apply. |
I see. Let me think about this. |
@bgyori: for the stop words matching, maybe we can improve the logic here? Note that this already handles "impact" vs. "IMPACT" correctly. That is, the former is not extracted, while the latter is marked as a GGP and grounded to Q9P2X3. |
I see, you're right about |
Let's discuss stop words first, since they may be simpler to address.
|
This is related to clulab/bioresources#32, and also clulab/bioresources#30. Currently, stopwords and overrides are used in a case-insensitive way with the following behavior:
M2
is a chemical name, whereasm2
ought to be a stopword (mostly used to mean meter squared). Another example isIMPACT
which is a perfectly valid gene name if capitalized like this (https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:20387) butimpact
(very common and never referring to this gene) ought to be a stopword.pLK
orrSK1
, which we ought to apply case-sensitive overrides to, to correct for the fact that they get grounded as if they werePLK
andRSK1
, respectively.(All this is true for grounding in Reach in general where the priority order between grounding files defines which grounding is chosen rather than which entry's case-sensitive match is closer to the string. But I am not necessarily sure that that should be changed.)
So the question is, would it be straightforward to change this behavior?
The text was updated successfully, but these errors were encountered: