Chinese segmentation not correct #226

sivdead · 2023-07-05T08:36:07Z

I notice that this program use jieba.cut to cut Chinese words,but it seems not works well at some time;
egg,use Chinese word 永永远远是龙的传人,jieba.cut will result to 永永远远/是/龙的传人, but when use jieba.cut_for_search, the result would be 永远/远远/永永远远/是/传人/龙的传人， I think its better for index search.

The text was updated successfully, but these errors were encountered:

sivdead · 2023-07-05T08:37:45Z

I can make a pr to solve this if you do think this should be fixed.

ManyTheFish · 2023-07-13T17:26:41Z

Hello @sivdead,
you're right, using cut_for_search would increase the recall of Meilisearch by splitting words in different ways.
However, Meilisearch relies on words position for queries, and Jieba.cut_for_search doesn't give any clues on the position of each token, moreover, charabia does not support shifting tokens.
In order to support this kind of position shifting behavior, the charabia output should be changed in a tree shape for instance 永永远远是龙的传人 would be shaped as:

永永远远 ──┬─► 是 ─┬─► 龙的传人
永远 ─────┤       └─► 传人
远远 ─────┘

Which is not possible without doing a huge job,
But I have to admit that it would enhance significantly the search recall.

Thank you for your report and sorry for the time to answer,

see you!

curquiza added support Issues related to support questions and removed support Issues related to support questions labels Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chinese segmentation not correct #226

Chinese segmentation not correct #226

sivdead commented Jul 5, 2023

sivdead commented Jul 5, 2023

ManyTheFish commented Jul 13, 2023

Chinese segmentation not correct #226

Chinese segmentation not correct #226

Comments

sivdead commented Jul 5, 2023

sivdead commented Jul 5, 2023

ManyTheFish commented Jul 13, 2023