Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional Lindera tokenizer support (was: Custom tokenizer support) #200

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
543 changes: 519 additions & 24 deletions Cargo.lock

Large diffs are not rendered by default.

8 changes: 8 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,15 @@ futures = "0.3.26"
pythonize = "0.20.0"
serde = "1.0"
serde_json = "1.0.91"
# Lindera
lindera-core = { version = "0.27.2", optional = true }
lindera-dictionary = { version = "0.27.2", optional = true }
lindera-tantivy = { version = "0.27.1", optional = true, features = ["ipadic"] }
cjrh marked this conversation as resolved.
Show resolved Hide resolved

[dependencies.pyo3]
version = "0.20.0"
features = ["chrono", "extension-module"]

[features]
lindera = ["lindera-core", "lindera-dictionary", "lindera-tantivy"]

13 changes: 13 additions & 0 deletions noxfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,17 @@
def test(session):
session.install("-rrequirements-dev.txt")
session.install("-e", ".", "--no-build-isolation")
session.run("pytest", "-m", "not lindera", *session.posargs)


@nox.session(python=["3.8", "3.9", "3.10", "3.11", "3.12"])
def test_lindera(session):
session.install("-rrequirements-dev.txt")
session.install(
"--no-build-isolation",
'--config-settings',
'build-args="--features=lindera"',
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took a surprisingly long time to discover the correct way to spell how to get this value sent down to the build backend!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, even with the unified pyproject.toml approach, wiring Python builds is still a mess IMHO. Do you remember where in the Maturin documentation you would have expected this? Maybe an improvement is just a PR away?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maturin wasn't too bad, it is pretty easy to find "feature selection" and that shows up under the build-wheels section in the maturin docs. Much more difficult to find was how to propagate the feature parameter through the pip interface, as above.

I had a quick look, I think this section is probably the best place to mention the build-args parameter: https://www.maturin.rs/index.html?highlight=pip%20install#python-packaging-basics

"-e",
".",
)
session.run("pytest", *session.posargs)
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ dev = [
bindings = "pyo3"

[tool.pytest.ini_options]
markers = [
"lindera: mark a test as requiring lindera",
]
# Set the durations option and doctest modules
# See https://docs.pytest.org/en/latest/usage.html#durations
addopts = "--doctest-modules --durations=10"
Expand Down
12 changes: 12 additions & 0 deletions src/index.rs
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,18 @@ impl Index {
Ok(Index { index, reader })
}

/// Register the lindera tokenizer
///
/// This will only be available if tantivy-py was built with the "lindera"
/// feature.
#[cfg(feature = "lindera")]
fn register_lindera_tokenizer(
&self,
) {
let tokenizer = crate::lindera_tokenizer::create_tokenizer(lindera_core::mode::Mode::Normal);
cjrh marked this conversation as resolved.
Show resolved Hide resolved
self.index.tokenizers().register("lang_ja", tokenizer);
}

/// Create a `IndexWriter` for the index.
///
/// The writer will be multithreaded and the provided heap size will be
Expand Down
2 changes: 2 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ mod schema;
mod schemabuilder;
mod searcher;
mod snippet;
#[cfg(feature = "lindera")]
mod lindera_tokenizer;

use document::Document;
use facet::Facet;
Expand Down
16 changes: 16 additions & 0 deletions src/lindera_tokenizer.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
use lindera_core::mode::Mode;
use lindera_dictionary::{
load_dictionary_from_config, DictionaryConfig, DictionaryKind,
};
use lindera_tantivy::tokenizer::LinderaTokenizer;

pub fn create_tokenizer(mode: Mode) -> LinderaTokenizer {
let dictionary_config = DictionaryConfig {
kind: Some(DictionaryKind::IPADIC),
cjrh marked this conversation as resolved.
Show resolved Hide resolved
path: None,
};
let dictionary = load_dictionary_from_config(dictionary_config).unwrap();
let tokenizer = LinderaTokenizer::new(dictionary, None, mode);

tokenizer
}
18 changes: 18 additions & 0 deletions tests/test_lindera.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import pytest
pytestmark = pytest.mark.lindera

from tantivy import SchemaBuilder, Index, Document


def test_basic():
sb = SchemaBuilder()
sb.add_text_field("title", stored=True, tokenizer_name="lang_ja")
schema = sb.build()
index = Index(schema)
index.register_lindera_tokenizer()
writer = index.writer(50_000_000)
doc = Document()
doc.add_text("title", "成田国際空港")
writer.add_document(doc)
writer.commit()
index.reload()