-
-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Title search appears to be broken with Chinese characters (possibly all UTF-8 multibyte characters) #3587
Comments
Yes, this is a duplicate of openzim/libzim#794 |
As a Chinese user of kiwix-android v3.9.1, I'm afraid the title search is also broken, so this issue may not be an exact duplicate. For example, when I look up "毛泽东" (Mao Zedong in English, i.e. Chairman Mao of the PRC) in the Chinese Wikipedia (all-maxi version, 2023-09), there is no match. If I try again character by character, the first character "毛" will trigger a long list of matches. (I suppose "毛泽东" is listed there, but not among the top dozens.) On my phone the third match is "毛一公", so let's enter "一" after "毛". This time there is no match again. I believe this is still related to character encoding and text tokenization, as pointed out by @xiaoyifang. |
I think if the zims (which contain CJK )created using libzim before 8.2.1,they should all have this issue. |
@xiaoyifang Thanks for the pointer! I wonder if this is related to the missing English characters issue in any way. It's still a big issue in the latest all-maxi Chinese Wikipedia zim. |
The ZIM I tested was created in December, whereas that PR was merged back in June. Do we know which libzim is currently being used in mwOflliner? |
This is fixed, but MWoffliner, the scraper for Wikipedia still uses and old version of the libzim. Everything works fine here. We just need to complete openzim/mwoffliner#1702 |
But just to point out that this issue relates to title search not working on the Android app with UTF8 multibyte characters, rather than Xapian search, which is what was fixed by openzim/libzim#802. Or maybe the Android app doesn't have title search any more (which is a shame if so, and a problem for searching any ZIM that doesn't have a Xapian index -- surely that can't be the case)? |
I don't want to belabour the point, but I tested title search in |
@Jaifroid Our ZIM files, at Kiwix, have two title indexes, see https://wiki.openzim.org/wiki/Search_indexes. If the Xapian one is there, then it ignores the native ZIM one which is the thing to do. |
No problem, but to allow to move forward we need to be very precise about what we do. For example here, you take a non-public special ZIM (only for apps) file which is not part of the one reported first. That means, by doing so, you fundamentally change the scope of the bug report and that does not really make things easier.
It does have "a Xapian index". One for the titles suggestions, but not a fulltext Xapian index. This is done on purpose because:
I have already given the reason why it does not work I believe. Which other apps have you tested with? Might that be this is one which does work with the ZIM native title index? |
Thanks for the further explanations, they help pinpoint the potential scope of this issue. It's clearly a serious problem for Chinese users.
I tested with Kiwix Android, Kiwix Destkop (Windows) 2.3.1-2, and Kiwix PWA. The last two can do title search on the Chinese medicine ZIM. The Android app can't . I chose that ZIM to test because I thought it would narrow down the issue. However, since it does indeed contain a Xapian non-FT index (something I was unaware of), it seems likely it should work with the Android app once the fix is in production. If not, we can revisit after. I'm not sure I agree that ignoring binary search of the title index is good behaviour for an app. It should be the last fallback IMHO. At least for KJS apps, searching Xapian indices is very slow, so will always be secondary to binary title search unless we can speed things up. |
I don't know for Kiwix Desktop 2.3.1-2, but I tested with cutting-edge version of Kiwix-Desktop (dev) and it does not work and this is normal (I just have checked because I was worried by your sentence). You should test with a ZIM made with a recent version of the libzim like https://library.kiwix.org/viewer#gutenberg_zh_all_2023-12/ ... and then things work like they should. |
OK, sorry to have worried you |
A user on reddit has reported that search in Chinese text is no-longer working on Android v3.8.1, both the Google Play and the APK version.
Assuming this issue can be reproduced (should be easy with a Chinese ZIM, and using search for one of the titles given by the Random button, if that works), then I would suspect that UTF-8 3-byte (most Chinese characters) and UTF-8 4-byte character codes are somehow not being catered for when reading the search field.
The text was updated successfully, but these errors were encountered: