Support for Devanagri and Indian Languages #123

srnthsrdhrn · 2021-05-06T09:37:54Z

Hi.
First of all, I would like to thank you for creating such a wonderful library. Really helps me a lot.

I am trying to use this for Devanagri (the script for Hindi) specifically, where I am facing issues.

The issue is when I am trying to extract keywords from a particular string, even strings containing that keyword as substrings are getting selected.

Example:

If I am searching for "Pam" I am also getting "Pamella".

From my rough understanding of the underlying algorithm, these cases ideally shouldn't occur.

So I am assuming this is something to do with the script of the text. Do we have a solution for this?

I came across this issue with Chinese: #43

Where you mentioned an absence of proper tokenization for the language is the issue. If that is the case here, I should be able to help in that regard.

For people who are coming to this issue for a solution, I am temporarily using a hack to get around this,

I use flashtext to extract the keywords and use the regex library to search for only those extracted keywords. Regex has support for unicode scripts and hence the regex expressions with word boundaries work for me.
So flashtext kind of reduces the search space for me, and regex is able to give good turnaround times there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Devanagri and Indian Languages #123

Support for Devanagri and Indian Languages #123

srnthsrdhrn commented May 6, 2021 •

edited

Loading

Support for Devanagri and Indian Languages #123

Support for Devanagri and Indian Languages #123

Comments

srnthsrdhrn commented May 6, 2021 • edited Loading

srnthsrdhrn commented May 6, 2021 •

edited

Loading