Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting tokenizer register #25

Open
ghost opened this issue Oct 1, 2020 · 13 comments
Open

Supporting tokenizer register #25

ghost opened this issue Oct 1, 2020 · 13 comments
Assignees
Labels
enhancement New feature or request

Comments

@ghost
Copy link

ghost commented Oct 1, 2020

Currently, the tokenizer is hard-coded to default, it would be better to include some configurable tokenizer for Chinese (tantivy-jieba and cang-jie), Japanese (lindera and tantivy-tokenizer-tiny-segmente) and Korean (lindera + lindera-ko-dic-builder)

https://github.com/tantivy-search/tantivy-py/blob/4ecf7119ea2fc5b3660f38d91a37dfb9e71ece7d/src/schemabuilder.rs#L85

@ghost
Copy link
Author

ghost commented Oct 1, 2020

@fulmicoton

Also note that tantivy-py does not come with a japanese tokenizer.
Tantivy has a good and maintained tokenizer called Lindera. If you know rust, you may have to compile your own version of tantivy-py.

I am trying to add LinderaTokenizer to https://github.com/tantivy-search/tantivy-py/blob/master/src/schemabuilder.rs#L85, but I couldn't figure out where

 index
         .tokenizers()
         .register("lang_ja", LinderaTokenizer::new("decompose", ""));

should go. Do you have any idea?

@fulmicoton
Copy link
Contributor

Anywhere as long as it happens before you index your documents.

Also make sure you declared in the schema that you want to use the tokenizer named "lang_ja" for your japanese fields.

@fulmicoton
Copy link
Contributor

Note that these tokenizer typically required shipping a dictionary that is several MB large, so it will not be shipped by default.
Ideally that should be in a different python package, and registration of the tokenizer should be done by the user as suggested by @acc557

@zhangchunlin
Copy link

What's the progress of adding support configurable tokenizer like tantivy-jieba? This is badly needed for non-ascii text indexing.

@fulmicoton
Copy link
Contributor

I don't have time to work on this but any help is welcome.

@zhangchunlin
Copy link

Could you provide some directions/suggestions we can try? I am willing to do something for this.
Thank you~

Sidhant29 pushed a commit to Sidhant29/tantivy-py that referenced this issue Jan 16, 2023
…ions/step-security/harden-runner-2.0.0

build(deps): bump step-security/harden-runner from 1.4.4 to 2.0.0
@adamreichold
Copy link
Collaborator

I think a useful approach might be to add optional features to this crate which can be enabled when building it from source using Maturin to include additional tokenizers. Not sure how to best integrate this with pip's optional dependency support though...

@cjrh cjrh self-assigned this Sep 13, 2023
@cjrh
Copy link
Collaborator

cjrh commented Sep 13, 2023

I will look at this within the next two weeks or so.

@cjrh
Copy link
Collaborator

cjrh commented Jan 11, 2024

For (my own) future reference the upstream tantivy docs for custom tokenizers is here.

@cjrh
Copy link
Collaborator

cjrh commented Jan 21, 2024

I've started working on this in a branch here (currently incomplete): https://github.com/cjrh/tantivy-py/tree/custom-tokenizer-support

I think it will be possible to add support via features as suggested. We could also consider making builds that include support, just to make it a bit easier for users who might not have or want a rust toolchain. But we'll have to be careful about combinatorial explosion of the builds. Perhaps we'll limit the platforms for the "big" build for example.

@cjrh
Copy link
Collaborator

cjrh commented Jan 29, 2024

I've done a bit more work and put up my PR in draft mode #200 . I will try to add tantivy-jieba in a similar way under fflag in the next batch of work I get around to.

The user will have to build the tantivy-py wheel with the additional build-args="--features=lindera" setting. (The tests demonstrate this.)

I've added a small Python test that shows the "user API" of enabling Lindera. We could decide that if the build is a Lindera build, then it should not be necessary to manually register the lindera tokenizer, as below:

def test_basic():
    sb = SchemaBuilder()
    sb.add_text_field("title", stored=True, tokenizer_name="lang_ja")
    schema = sb.build()
    index = Index(schema)
    index.register_lindera_tokenizer()
    writer = index.writer(50_000_000)
    doc = Document()
    doc.add_text("title", "成田国際空港")
    writer.add_document(doc)
    writer.commit()
    index.reload()

What is the user expectation of whether or not something like register_lindera_tokenizer() should be called?

Also, there are things that seem like settings in the configuration of the tokenizer itself (what's "mode"?). And finally, the examples in the README at https://github.com/lindera-morphology/lindera-tantivy show use of TextOptions, which means we probably need support for that in tantivy-py? (already done)

@pokutuna
Copy link

I really appreciate your ongoing work on this issue.

Also, there are things that seem like settings in the configuration of the tokenizer itself (what's "mode"?).

You may have already resolved this question in #200, where you mentioned configurable options.
The explanation for Lindera can be found here (you probably know this already).


As a Japanese speaker, let me explain why these modes exist.

In Japanese text, words are not separated by spaces (same for Chinese and Thai).
It's common to combine multiple words to form parts of speech.
So a long word can sometimes be a combination of shorter words.

"成田国際空港" (Narita International Airport) is a proper noun representing a single facility (mode=default),
but it can be split into "成田/国際/空港" (Narita/International/Airport) in the same semantic units as English words (mode=decompose).

Which mode to use depends on the search needs.
Splitting words into smaller parts (decompose mode) lets us match parts of words but in exchange, it increases the need for better result sorting and makes the index bigger.
When expanding queries to simultaneously search for synonyms, it's often better to keep proper nouns.

Interestingly, some libraries even provide three tokenization modes.
https://github.com/WorksApplications/Sudachi#the-modes-of-splitting

@cjrh
Copy link
Collaborator

cjrh commented Oct 21, 2024

Welcome @pokutuna 😄

Thanks for the information, this is helpful for me to know why the modes exist.

For a status update, I am currently working through options to make these tokenizers installable at runtime, rather than making custom builds that include them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants