Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic when there is CJK character in the document #24

Closed
ghost opened this issue Oct 1, 2020 · 2 comments · Fixed by #26
Closed

Panic when there is CJK character in the document #24

ghost opened this issue Oct 1, 2020 · 2 comments · Fixed by #26

Comments

@ghost
Copy link

ghost commented Oct 1, 2020

When trying the demo code with CJK characters,

writer.add_document(tantivy.Document(
    title=["老人与海"],
    body=[""" ..."""],
))

the thread panic at

thread '' panicked at 'assertion failed: self.is_char_boundary(new_len)', /rustc/04488afe34512aa4c33566eb16d8c912a3ae04f9\src\libcore\macros\mod.rs:10:9

@fulmicoton
Copy link
Contributor

The bug is confiremd.

The problem is in the implementation of Debug for Document. https://github.com/tantivy-search/tantivy-py/blob/master/src/document.rs#L102

@acc557 We will fix this swiftly. In the meanwhile, the workaround is fairly obvious...

You cannot call repr(doc). But you can do anything else.

print(doc["title"]) for instance, should work fine.

Also note that tantivy-py does not come with a japanese tokenizer.
Tantivy has a good and maintained tokenizer called Lindera. If you know rust, you may have to compile your own version of tantivy-py.

@ghost
Copy link
Author

ghost commented Oct 1, 2020

Thank you for the really fast response!
I also created another issue #25 about tokenizers.

@poljar poljar closed this as completed in c86f0fc Oct 1, 2020
Sidhant29 pushed a commit to Sidhant29/tantivy-py that referenced this issue Jan 16, 2023
…ions/actions-rs/toolchain-16499b5e05bf2e26879000db0c1d13f7e13fa3af

build(deps): bump actions-rs/toolchain from 63eb9591781c46a70274cb3ebdf190fce92702e8 to 16499b5e05bf2e26879000db0c1d13f7e13fa3af
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant