You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi all, first thanks a lot for the great library you created, I really appreciate it!
When working with non-ascii characters I found a case, where the span returned by the KeywordProcessor is wrong, when case_sentsitive=False.
Please find a sample below that reproduces the error:
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keyword('Bay Area')
text = 'İ I love big Apple and Bay Area.' # added the "İ" non-ascii character
keywords_found = keyword_processor.extract_keywords(text, span_info=True)
for match in keywords_found:
print(match)
print(text[match[1]:match[2]])
Output:
('Bay Area', 24, 32)
ay Area. # the span is shifted by one
When looking in the error, I figured out, that the length of the “İ” changes from 1 (when uppercase) to 2 (when lowercase), which I believe results in the span shift (because the span is only wrong when non-case sensitive).
len("İ")
Out[39]: 1
len("İ".lower())
Out[40]: 2
Could any of the authors comment on the issue and mention, if they intent to do something about it or if it is out of scope?
Thanks a lot!
The text was updated successfully, but these errors were encountered:
raphael0202
added a commit
to openfoodfacts/robotoff
that referenced
this issue
Apr 27, 2023
Hey Mauro, it doesn't look like the repo is being actively maintained these days. As a pet project, I was going to go through the codebase and give this a revamp, and given this issue is not exceptionally common, non-ascii character or otherwise, what I've done to address the issues amounts to the following:
inserting some thoughtfully-place if statements to catch instances where the lengths differ over lowercasing, and raise a ValueError in such cases.
ensure appropriate text normalisation prior to inputting the text as an argument to functions which make use of lowercasing.
In such instances, the onus is usually on the user to make sure the text is normalised, and this is fundamentally a text cleanliness issue, rather than an issue with calculating the spans, which thus far looks to be behaving as it should in this case. If you modify the length of the string part way through, I would consider raising an error to be sensible and block the span from calculating an incorrect value.
Hi all, first thanks a lot for the great library you created, I really appreciate it!
When working with non-ascii characters I found a case, where the span returned by the
KeywordProcessor
is wrong, whencase_sentsitive=False
.Please find a sample below that reproduces the error:
Output:
When looking in the error, I figured out, that the length of the “İ” changes from 1 (when uppercase) to 2 (when lowercase), which I believe results in the span shift (because the span is only wrong when non-case sensitive).
Could any of the authors comment on the issue and mention, if they intent to do something about it or if it is out of scope?
Thanks a lot!
The text was updated successfully, but these errors were encountered: