Term Query is not tokenized (?) #296

afbarbaro · 2024-06-06T09:37:29Z

I'm testing tantivy-py, which I'm finding pretty great. However, I bumped into what seems to be an issue with the Python package: it seems that term queries are not tokenized when using the searcher.search(query, ..) method, so I can't really use the en_stem tokenizer (since it's not exposed for me to tokenize the query, only the indexing of documents).

I'm testing tavinty-py with the Simple Wikipedia Example Set from Cohere and here's what I see with a few sample queries:

Australia monarchy --> no good hits unless I change it to Australia monarch
Titanic sink --> no good hits unless I change it to Titan sink

Is this a "feature" or a "bug"? I don't mind tokenizing the query myself before calling the search method, but tokenizers are not exposed in the Python bindings.

Any suggestions?

The text was updated successfully, but these errors were encountered:

cjrh · 2024-06-06T13:36:20Z

Hi @afbarbaro 😄

This is the old problem of how stemmers don't always do what we think they do.

Here's a small program I made to test queries quickly:

from tantivy import SchemaBuilder, Index, Document
schema = SchemaBuilder().add_text_field("body", stored=True, tokenizer_name="en_stem").build()
index = Index(schema=schema, path=None)
writer = index.writer(heap_size=15_000_000, num_threads=1)
doc = Document()
doc.add_text("body", "lovely freshness older")
writer.add_document(doc)
doc = Document()
doc.add_text("body", "Australian monarchy")
writer.add_document(doc)
doc = Document()
doc.add_text("body", "Titanic sinks")
writer.add_document(doc)
writer.commit()
index.reload()

searcher = index.searcher()

def find(query_text: str):
    q = index.parse_query(query_text, ["body"])
    if hits := searcher.search(q).hits:
        for _, doc_address in hits:
            doc = searcher.doc(doc_address)
            print(f"{query_text} hit: {doc['body']}")
    else:
        print(f"{query_text} not found")

# Run with `python -i main.py`

The idea is to run it with python -i main.py, so that when the script ends it leaves you in an interpreter where you can test using the find() function.

Indeed, as you saw, monarchy does not stem to monarch:

$ python -i main.py
>>> find('monarch')
monarch not found
>>> find('monarchs')
monarchs not found
>>> find('monarchy')
monarchy hit: ['Australian monarchy']

However, you'll see that if we search for Australians (with an "s"), which was not the term indexed, we'll get a hit:

>>> find('Australians')
Australians hit: ['Australian monarchy']

This means that tantivy-py is indeed stemming the query text. I checked the code and the query parsers does make use of the tokenizers registered on the fields.

Stemming can be very surprising, for example:

>>> find('monarchies')
monarchies hit: ['Australian monarchy']

Did you expect monarchies to work? The reason this happens is because the shared stem of both monarchy and monarchies is in fact monarchi, which you can verify on this demo page for the snowball stemmer: https://snowballstem.org/demo.html

In my test I do get a hit using Titan so I'm not sure why that doesn't work for you:

>>> find('Titan')
Titan hit: ['Titanic sinks']
>>> find('Titans')
Titans hit: ['Titanic sinks']
>>> find('Titanics')
Titanics hit: ['Titanic sinks']
>>> find('titanics')
titanics hit: ['Titanic sinks']

Sometimes stemming can be very frustrating. For example, in the first document you would think older should stem to old but this doesn't happen:

>>> find('old')
old not found
>>> find('older')
older hit: ['lovely freshness older']

Even oldest doesn't stem back to older:

>>> find('oldest')
oldest not found

This is because the stem for older is just older.

afbarbaro · 2024-06-06T22:46:15Z

@cjrh thanks so much for the detailed explanation. I know what is the difference between your code and example and what I was doing: you're using index.parse_query(query_text) and I was building the query using this:

subqueries = [(Occur.Should, Query.term_query(index.schema, 'body', term)) for term in terms]
query = tantivy.Query.boolean_query(subqueries)

so I see that the stemming is indeed applied to the query text when using index.parse_query but it is not applied to the text given to Query.term_query. Does this make sense? And is this by design?

One of the reasons why I was constructing the query myself is because in the Python version of index.parse_query you cannot pass the additional arguments available in the Rust true implementations - particularly the map for how to deal with Fuzzy search for a given field.

So I guess my question now is:

can I get the additional arguments for parse_query?
should term_query stem? if not, should there be a way to access the tokenizer for the field and I can invoke it before giving the term to term_query.

Again, thank you for your help!

afbarbaro · 2024-06-06T23:22:47Z

Also, for this:

should term_query stem? if not, should there be a way to access the tokenizer for the field and I can invoke it before giving the term to term_query.

There is a hack that I can use at least temporarily to get the stemmed terms by running parsed = index.parse_query(query_text) then cast the parsed variable to a str and get the stemmed words for each term query inside that. Pretty nasty, but gets me unblocked.

cjrh · 2024-06-07T13:30:31Z

so I see that the stemming is indeed applied to the query text when using index.parse_query but it is not applied to the text given to Query.term_query. Does this make sense? And is this by design?

Ah I see. To answer your question, this is not by design. Tantivy-py has been built by a fairly large number of volunteers and drive-by contributors over the years so there is relatively little that is specifically "designed". Earlier on we wanted to avoid adding all the fine-grained query classes (and other classes) and just have the parse_query function, but in the end this proved to be too limiting and so recently we started adding those classes in.

I think more people will come across this now that we have the Query class, so we'll need some kind of answer for how to do stemming on the query here. If you were using the upstream tantivy in Rust, what would the approach be? As a general guiding principle I'd tantivy-py to be a recognisable wrapper of tantivy so that familiarity with one allows famiiliarity with the other. Python in general has had great success with this approach of wrapping C libraries over the decades.

cjrh · 2024-06-07T13:34:55Z

should term_query stem?

Maybe.

It wraps TermQuery, so I'd want to know how stemming is handled in tantivy first, before deciding on a path for us.

cjrh · 2024-06-07T13:41:03Z

We do have the fuzzy map in parse_query, at least in main:

afbarbaro · 2024-06-07T23:04:31Z

Thanks @cjrh . The reason I didn't this these additional args being there is because the Python bindings are outdated.

tantivy-py/tantivy/tantivy.pyi

Line 364 in e3de7b1

def parse_query(

Are these created by hand or by some process? Maybe fixing this is a simple PR I can contribute.

cjrh · 2024-06-08T08:40:13Z

Yes please that would be great if you can update those type annotations.

cjrh · 2024-06-10T11:55:31Z

Are these created by hand or by some process? Maybe fixing this is a simple PR I can contribute.

Yes created by hand. It would be a simple PR.

afbarbaro changed the title ~~Query term is not tokenized (?)~~ Term Query is not tokenized (?) Jun 6, 2024

cjrh added the documentation Improvements or additions to documentation label Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Term Query is not tokenized (?) #296

Term Query is not tokenized (?) #296

afbarbaro commented Jun 6, 2024

cjrh commented Jun 6, 2024

afbarbaro commented Jun 6, 2024

afbarbaro commented Jun 6, 2024

cjrh commented Jun 7, 2024

cjrh commented Jun 7, 2024

cjrh commented Jun 7, 2024

afbarbaro commented Jun 7, 2024

cjrh commented Jun 8, 2024

cjrh commented Jun 10, 2024

Term Query is not tokenized (?) #296

Term Query is not tokenized (?) #296

Comments

afbarbaro commented Jun 6, 2024

cjrh commented Jun 6, 2024

afbarbaro commented Jun 6, 2024

afbarbaro commented Jun 6, 2024

cjrh commented Jun 7, 2024

cjrh commented Jun 7, 2024

cjrh commented Jun 7, 2024

afbarbaro commented Jun 7, 2024

cjrh commented Jun 8, 2024

cjrh commented Jun 10, 2024