KeywordProcessor returns wrong span for text containing non-ascii characters when case_sentsitive=False #119

MauroLuzzatto · 2020-11-15T16:30:53Z

Hi all, first thanks a lot for the great library you created, I really appreciate it!

When working with non-ascii characters I found a case, where the span returned by the KeywordProcessor is wrong, when case_sentsitive=False.

Please find a sample below that reproduces the error:

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keyword('Bay Area')

text = 'İ I love big Apple and Bay Area.'  # added the "İ" non-ascii character 

keywords_found = keyword_processor.extract_keywords(text, span_info=True)

for match in keywords_found:
    print(match)
    print(text[match[1]:match[2]])

Output:

('Bay Area', 24, 32)
ay Area. # the span is shifted by one

When looking in the error, I figured out, that the length of the “İ” changes from 1 (when uppercase) to 2 (when lowercase), which I believe results in the span shift (because the span is only wrong when non-case sensitive).

len("İ")
Out[39]: 1

len("İ".lower())
Out[40]: 2

Could any of the authors comment on the issue and mention, if they intent to do something about it or if it is out of scope?

Thanks a lot!

The text was updated successfully, but these errors were encountered:

See vi3k6i5/flashtext#119

NLPShenanigans · 2023-11-23T16:02:43Z

Hey Mauro, it doesn't look like the repo is being actively maintained these days. As a pet project, I was going to go through the codebase and give this a revamp, and given this issue is not exceptionally common, non-ascii character or otherwise, what I've done to address the issues amounts to the following:

inserting some thoughtfully-place if statements to catch instances where the lengths differ over lowercasing, and raise a ValueError in such cases.
ensure appropriate text normalisation prior to inputting the text as an argument to functions which make use of lowercasing.

In such instances, the onus is usually on the user to make sure the text is normalised, and this is fundamentally a text cleanliness issue, rather than an issue with calculating the spans, which thus far looks to be behaving as it should in this case. If you modify the length of the string part way through, I would consider raising an error to be sensible and block the span from calculating an incorrect value.

raphael0202 added a commit to openfoodfacts/robotoff that referenced this issue Apr 27, 2023

fix: fix span offset issue when case_sensitive=False

e865c75

See vi3k6i5/flashtext#119

raphael0202 mentioned this issue Apr 27, 2023

fix: fix flashtext openfoodfacts/robotoff#1108

Merged

raphael0202 added a commit to openfoodfacts/robotoff that referenced this issue Apr 27, 2023

fix: fix span offset issue when case_sensitive=False

15eb2f8

See vi3k6i5/flashtext#119

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeywordProcessor returns wrong span for text containing non-ascii characters when case_sentsitive=False #119

KeywordProcessor returns wrong span for text containing non-ascii characters when case_sentsitive=False #119

MauroLuzzatto commented Nov 15, 2020 •

edited

Loading

NLPShenanigans commented Nov 23, 2023

KeywordProcessor returns wrong span for text containing non-ascii characters when case_sentsitive=False #119

KeywordProcessor returns wrong span for text containing non-ascii characters when case_sentsitive=False #119

Comments

MauroLuzzatto commented Nov 15, 2020 • edited Loading

NLPShenanigans commented Nov 23, 2023

MauroLuzzatto commented Nov 15, 2020 •

edited

Loading