Supporting tokenizer register #25

ghost · 2020-10-01T06:18:03Z

Currently, the tokenizer is hard-coded to default, it would be better to include some configurable tokenizer for Chinese (tantivy-jieba and cang-jie), Japanese (lindera and tantivy-tokenizer-tiny-segmente) and Korean (lindera + lindera-ko-dic-builder)

https://github.com/tantivy-search/tantivy-py/blob/4ecf7119ea2fc5b3660f38d91a37dfb9e71ece7d/src/schemabuilder.rs#L85

ghost · 2020-10-01T06:53:25Z

@fulmicoton

Also note that tantivy-py does not come with a japanese tokenizer.
Tantivy has a good and maintained tokenizer called Lindera. If you know rust, you may have to compile your own version of tantivy-py.

I am trying to add LinderaTokenizer to https://github.com/tantivy-search/tantivy-py/blob/master/src/schemabuilder.rs#L85, but I couldn't figure out where

 index
         .tokenizers()
         .register("lang_ja", LinderaTokenizer::new("decompose", ""));

should go. Do you have any idea?

fulmicoton · 2020-10-02T00:19:53Z

Anywhere as long as it happens before you index your documents.

Also make sure you declared in the schema that you want to use the tokenizer named "lang_ja" for your japanese fields.

fulmicoton · 2020-10-02T00:23:06Z

Note that these tokenizer typically required shipping a dictionary that is several MB large, so it will not be shipped by default.
Ideally that should be in a different python package, and registration of the tokenizer should be done by the user as suggested by @acc557

zhangchunlin · 2022-05-07T02:31:27Z

What's the progress of adding support configurable tokenizer like tantivy-jieba? This is badly needed for non-ascii text indexing.

fulmicoton · 2022-05-09T02:59:05Z

I don't have time to work on this but any help is welcome.

zhangchunlin · 2022-05-12T01:51:20Z

Could you provide some directions/suggestions we can try? I am willing to do something for this.
Thank you~

…ions/step-security/harden-runner-2.0.0 build(deps): bump step-security/harden-runner from 1.4.4 to 2.0.0

adamreichold · 2023-06-19T11:36:53Z

I think a useful approach might be to add optional features to this crate which can be enabled when building it from source using Maturin to include additional tokenizers. Not sure how to best integrate this with pip's optional dependency support though...

cjrh · 2023-09-13T14:25:39Z

I will look at this within the next two weeks or so.

cjrh · 2024-01-11T10:34:46Z

For (my own) future reference the upstream tantivy docs for custom tokenizers is here.

cjrh · 2024-01-21T23:40:28Z

I've started working on this in a branch here (currently incomplete): https://github.com/cjrh/tantivy-py/tree/custom-tokenizer-support

I think it will be possible to add support via features as suggested. We could also consider making builds that include support, just to make it a bit easier for users who might not have or want a rust toolchain. But we'll have to be careful about combinatorial explosion of the builds. Perhaps we'll limit the platforms for the "big" build for example.

cjrh · 2024-01-29T23:35:36Z

I've done a bit more work and put up my PR in draft mode #200 . I will try to add tantivy-jieba in a similar way under fflag in the next batch of work I get around to.

The user will have to build the tantivy-py wheel with the additional build-args="--features=lindera" setting. (The tests demonstrate this.)

I've added a small Python test that shows the "user API" of enabling Lindera. We could decide that if the build is a Lindera build, then it should not be necessary to manually register the lindera tokenizer, as below:

def test_basic():
    sb = SchemaBuilder()
    sb.add_text_field("title", stored=True, tokenizer_name="lang_ja")
    schema = sb.build()
    index = Index(schema)
    index.register_lindera_tokenizer()
    writer = index.writer(50_000_000)
    doc = Document()
    doc.add_text("title", "成田国際空港")
    writer.add_document(doc)
    writer.commit()
    index.reload()

What is the user expectation of whether or not something like register_lindera_tokenizer() should be called?

Also, there are things that seem like settings in the configuration of the tokenizer itself (what's "mode"?). ~~And finally, the examples in the README at https://github.com/lindera-morphology/lindera-tantivy show use of TextOptions, which means we probably need support for that in tantivy-py?~~ (already done)

pokutuna · 2024-10-20T20:01:31Z

I really appreciate your ongoing work on this issue.

Also, there are things that seem like settings in the configuration of the tokenizer itself (what's "mode"?).

You may have already resolved this question in #200, where you mentioned configurable options.
The explanation for Lindera can be found here (you probably know this already).

As a Japanese speaker, let me explain why these modes exist.

In Japanese text, words are not separated by spaces (same for Chinese and Thai).
It's common to combine multiple words to form parts of speech.
So a long word can sometimes be a combination of shorter words.

"成田国際空港" (Narita International Airport) is a proper noun representing a single facility (mode=default),
but it can be split into "成田/国際/空港" (Narita/International/Airport) in the same semantic units as English words (mode=decompose).

Which mode to use depends on the search needs.
Splitting words into smaller parts (decompose mode) lets us match parts of words but in exchange, it increases the need for better result sorting and makes the index bigger.
When expanding queries to simultaneously search for synonyms, it's often better to keep proper nouns.

Interestingly, some libraries even provide three tokenization modes.
https://github.com/WorksApplications/Sudachi#the-modes-of-splitting

cjrh · 2024-10-21T10:24:54Z

Welcome @pokutuna 😄

Thanks for the information, this is helpful for me to know why the modes exist.

For a status update, I am currently working through options to make these tokenizers installable at runtime, rather than making custom builds that include them.

ghost mentioned this issue Oct 1, 2020

Panic when there is CJK character in the document #24

Closed

Sidhant29 pushed a commit to Sidhant29/tantivy-py that referenced this issue Jan 16, 2023

Merge pull request quickwit-oss#25 from Kapiche/dependabot/github_act…

2dd5501

…ions/step-security/harden-runner-2.0.0 build(deps): bump step-security/harden-runner from 1.4.4 to 2.0.0

cjrh self-assigned this Sep 13, 2023

cjrh mentioned this issue Jan 11, 2024

Question: how can you escape double quotes in search queries? #185

Closed

cjrh added the enhancement New feature or request label Jan 27, 2024

cjrh mentioned this issue Jan 29, 2024

Optional Lindera tokenizer support (was: Custom tokenizer support) #200

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting tokenizer register #25

Supporting tokenizer register #25

ghost commented Oct 1, 2020

ghost commented Oct 1, 2020

fulmicoton commented Oct 2, 2020

fulmicoton commented Oct 2, 2020

zhangchunlin commented May 7, 2022

fulmicoton commented May 9, 2022

zhangchunlin commented May 12, 2022

adamreichold commented Jun 19, 2023

cjrh commented Sep 13, 2023

cjrh commented Jan 11, 2024

cjrh commented Jan 21, 2024

cjrh commented Jan 29, 2024 •

edited

Loading

pokutuna commented Oct 20, 2024

cjrh commented Oct 21, 2024

Supporting tokenizer register #25

Supporting tokenizer register #25

Comments

ghost commented Oct 1, 2020

ghost commented Oct 1, 2020

fulmicoton commented Oct 2, 2020

fulmicoton commented Oct 2, 2020

zhangchunlin commented May 7, 2022

fulmicoton commented May 9, 2022

zhangchunlin commented May 12, 2022

adamreichold commented Jun 19, 2023

cjrh commented Sep 13, 2023

cjrh commented Jan 11, 2024

cjrh commented Jan 21, 2024

cjrh commented Jan 29, 2024 • edited Loading

pokutuna commented Oct 20, 2024

cjrh commented Oct 21, 2024

cjrh commented Jan 29, 2024 •

edited

Loading