You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Where you mentioned an absence of proper tokenization for the language is the issue. If that is the case here, I should be able to help in that regard.
For people who are coming to this issue for a solution, I am temporarily using a hack to get around this,
I use flashtext to extract the keywords and use the regex library to search for only those extracted keywords. Regex has support for unicode scripts and hence the regex expressions with word boundaries work for me.
So flashtext kind of reduces the search space for me, and regex is able to give good turnaround times there.
The text was updated successfully, but these errors were encountered:
Hi.
First of all, I would like to thank you for creating such a wonderful library. Really helps me a lot.
I am trying to use this for Devanagri (the script for Hindi) specifically, where I am facing issues.
The issue is when I am trying to extract keywords from a particular string, even strings containing that keyword as substrings are getting selected.
Example:
If I am searching for "Pam" I am also getting "Pamella".
From my rough understanding of the underlying algorithm, these cases ideally shouldn't occur.
So I am assuming this is something to do with the script of the text. Do we have a solution for this?
I came across this issue with Chinese: #43
Where you mentioned an absence of proper tokenization for the language is the issue. If that is the case here, I should be able to help in that regard.
For people who are coming to this issue for a solution, I am temporarily using a hack to get around this,
I use flashtext to extract the keywords and use the regex library to search for only those extracted keywords. Regex has support for unicode scripts and hence the regex expressions with word boundaries work for me.
So flashtext kind of reduces the search space for me, and regex is able to give good turnaround times there.
The text was updated successfully, but these errors were encountered: