-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting tokenizer register #25
Comments
I am trying to add LinderaTokenizer to https://github.com/tantivy-search/tantivy-py/blob/master/src/schemabuilder.rs#L85, but I couldn't figure out where
should go. Do you have any idea? |
Anywhere as long as it happens before you index your documents. Also make sure you declared in the schema that you want to use the tokenizer named "lang_ja" for your japanese fields. |
Note that these tokenizer typically required shipping a dictionary that is several MB large, so it will not be shipped by default. |
What's the progress of adding support configurable tokenizer like tantivy-jieba? This is badly needed for non-ascii text indexing. |
I don't have time to work on this but any help is welcome. |
Could you provide some directions/suggestions we can try? I am willing to do something for this. |
…ions/step-security/harden-runner-2.0.0 build(deps): bump step-security/harden-runner from 1.4.4 to 2.0.0
I think a useful approach might be to add optional features to this crate which can be enabled when building it from source using Maturin to include additional tokenizers. Not sure how to best integrate this with pip's optional dependency support though... |
I will look at this within the next two weeks or so. |
For (my own) future reference the upstream tantivy docs for custom tokenizers is here. |
I've started working on this in a branch here (currently incomplete): https://github.com/cjrh/tantivy-py/tree/custom-tokenizer-support I think it will be possible to add support via features as suggested. We could also consider making builds that include support, just to make it a bit easier for users who might not have or want a rust toolchain. But we'll have to be careful about combinatorial explosion of the builds. Perhaps we'll limit the platforms for the "big" build for example. |
I've done a bit more work and put up my PR in draft mode #200 . I will try to add tantivy-jieba in a similar way under fflag in the next batch of work I get around to. The user will have to build the tantivy-py wheel with the additional I've added a small Python test that shows the "user API" of enabling Lindera. We could decide that if the build is a Lindera build, then it should not be necessary to manually register the lindera tokenizer, as below: def test_basic():
sb = SchemaBuilder()
sb.add_text_field("title", stored=True, tokenizer_name="lang_ja")
schema = sb.build()
index = Index(schema)
index.register_lindera_tokenizer()
writer = index.writer(50_000_000)
doc = Document()
doc.add_text("title", "成田国際空港")
writer.add_document(doc)
writer.commit()
index.reload() What is the user expectation of whether or not something like Also, there are things that seem like settings in the configuration of the tokenizer itself (what's "mode"?). |
I really appreciate your ongoing work on this issue.
You may have already resolved this question in #200, where you mentioned configurable options. As a Japanese speaker, let me explain why these modes exist. In Japanese text, words are not separated by spaces (same for Chinese and Thai). "成田国際空港" (Narita International Airport) is a proper noun representing a single facility (mode=default), Which mode to use depends on the search needs. Interestingly, some libraries even provide three tokenization modes. |
Welcome @pokutuna 😄 Thanks for the information, this is helpful for me to know why the modes exist. For a status update, I am currently working through options to make these tokenizers installable at runtime, rather than making custom builds that include them. |
Currently, the tokenizer is hard-coded to default, it would be better to include some configurable tokenizer for Chinese (tantivy-jieba and cang-jie), Japanese (lindera and tantivy-tokenizer-tiny-segmente) and Korean (lindera + lindera-ko-dic-builder)
https://github.com/tantivy-search/tantivy-py/blob/4ecf7119ea2fc5b3660f38d91a37dfb9e71ece7d/src/schemabuilder.rs#L85
The text was updated successfully, but these errors were encountered: