-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optional Lindera tokenizer support (was: Custom tokenizer support) #200
base: master
Are you sure you want to change the base?
Conversation
session.install( | ||
"--no-build-isolation", | ||
'--config-settings', | ||
'build-args="--features=lindera"', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took a surprisingly long time to discover the correct way to spell how to get this value sent down to the build backend!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, even with the unified pyproject.toml
approach, wiring Python builds is still a mess IMHO. Do you remember where in the Maturin documentation you would have expected this? Maybe an improvement is just a PR away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maturin wasn't too bad, it is pretty easy to find "feature selection" and that shows up under the build-wheels
section in the maturin docs. Much more difficult to find was how to propagate the feature parameter through the pip interface, as above.
I had a quick look, I think this section is probably the best place to mention the build-args
parameter: https://www.maturin.rs/index.html?highlight=pip%20install#python-packaging-basics
I've added all the lindera features, and also an interface to specify configurable options on registration. The code is still rough, for example I need to rename quite a few things. I'd love to also figure out how to spell the function signatures correctly, e.g. for the |
It would be much nicer to provide this as an additional package that can be installed. The lack of a stable ABI prevents this, unless we wrap up the Lindera stuff inside a C wrapper and then call that. I'm in two minds about proceeding with this PR as-is, that's why I haven't moved forward with it. Still thinking about it. |
Just thinking out loud, we could consider adding some kind of new |
Must be something in the water, because this got published just yesterday. Many projects dealing with the plugin problem: https://www.arroyo.dev/blog/rust-plugin-systems |
Ok so the reason I haven't moved forward with this is because I want to explore different solutions that allow a user to pip install additional tokenizers and register them at runtime. It might be possible to create a C abi interface for tokenizers that will allow this to work. |
See #25
This is an early draft for discussion purposes. It's not ready to be used. The primary change is that we incorporate the Lindera tantivy tokenizer support under a feature flag. This means that tantivy-py must be built with that feature flag specified. The included tests have an example of this. It is possible to run the tests both with and without the Lindera support. We need to decide how we want this to be set up.