Case-sensitive handling of stopwords and overrides #680

bgyori · 2020-04-24T01:44:39Z

This is related to clulab/bioresources#32, and also clulab/bioresources#30. Currently, stopwords and overrides are used in a case-insensitive way with the following behavior:

If there is a grounding entry that matches a stopword in a case-insensitive way, the entity is extracted and the grounding is always produced. Example where this is a problem: M2 is a chemical name, whereas m2 ought to be a stopword (mostly used to mean meter squared). Another example is IMPACT which is a perfectly valid gene name if capitalized like this (https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:20387) but impact (very common and never referring to this gene) ought to be a stopword.
If there is a grounding entry that matches an override in a case-insensitive way, the override is always applied. Example where this is a problem: pLK or rSK1, which we ought to apply case-sensitive overrides to, to correct for the fact that they get grounded as if they were PLK and RSK1, respectively.

(All this is true for grounding in Reach in general where the priority order between grounding files defines which grounding is chosen rather than which entry's case-sensitive match is closer to the string. But I am not necessarily sure that that should be changed.)

So the question is, would it be straightforward to change this behavior?

The text was updated successfully, but these errors were encountered:

MihaiSurdeanu · 2020-04-24T02:49:39Z

This would take some engineering to fix...
We made this decision early on, because people write gene/protein names inconsistently. Sometimes they are capitalized, sometimes they are not. So, in addition of the engineering work, I am afraid that this change may cause us to lose some valid entities...

bgyori · 2020-04-24T03:48:19Z

What I am suggesting is a bit more specific than making everything case sensitive. Namely, this would only apply in cases where there is ambiguity between two choices, i.e., a grounding entry and either a stopword or an explicit override with different capitalization. If there is no ambiguity with either of these entry types then case-insensitive matching would continue to apply.

MihaiSurdeanu · 2020-04-24T03:49:39Z

I see. Let me think about this.

MihaiSurdeanu · 2020-04-25T04:03:32Z

@bgyori: for the stop words matching, maybe we can improve the logic here?
https://github.com/clulab/reach/blob/master/processors/src/main/scala/org/clulab/processors/bionlp/BioNERPostProcessor.scala#L85

Note that this already handles "impact" vs. "IMPACT" correctly. That is, the former is not extracted, while the latter is marked as a GGP and grounded to Q9P2X3.

bgyori · 2020-04-27T14:37:30Z

@bgyori: for the stop words matching, maybe we can improve the logic here?
https://github.com/clulab/reach/blob/master/processors/src/main/scala/org/clulab/processors/bionlp/BioNERPostProcessor.scala#L85

Note that this already handles "impact" vs. "IMPACT" correctly. That is, the former is not extracted, while the latter is marked as a GGP and grounded to Q9P2X3.

I see, you're right about impact, so that wasn't a valid example in my original comment. Maybe a better example for ignores is II which, when all caps should be ignored but Ii is a valid protein synonym. As for overrides, there is no special logic like for stopwords, and the override is applied irrespective of capitalization, right?

MihaiSurdeanu · 2020-04-28T14:21:51Z

Let's discuss stop words first, since they may be simpler to address.
However, I am struggling to find a general solution for handling stop words. We found in the past that capitalization is a strong indicator that we are looking at valid protein names... I can think of maybe two solutions:

A simple fix: we now remove stop words that have upper initial. We can refine this, and change it to removing them only they follow punctuation. If not, they are probably valid names.
We could come up with two stop word lists: (a) one that is case sensitive, and (b) one that is not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case-sensitive handling of stopwords and overrides #680

Case-sensitive handling of stopwords and overrides #680

bgyori commented Apr 24, 2020 •

edited

Loading

MihaiSurdeanu commented Apr 24, 2020

bgyori commented Apr 24, 2020

MihaiSurdeanu commented Apr 24, 2020

MihaiSurdeanu commented Apr 25, 2020

bgyori commented Apr 27, 2020

MihaiSurdeanu commented Apr 28, 2020

Case-sensitive handling of stopwords and overrides #680

Case-sensitive handling of stopwords and overrides #680

Comments

bgyori commented Apr 24, 2020 • edited Loading

MihaiSurdeanu commented Apr 24, 2020

bgyori commented Apr 24, 2020

MihaiSurdeanu commented Apr 24, 2020

MihaiSurdeanu commented Apr 25, 2020

bgyori commented Apr 27, 2020

MihaiSurdeanu commented Apr 28, 2020

bgyori commented Apr 24, 2020 •

edited

Loading