`writer.delete_documents` on tokenized fields behaves unintuitively #297

lvjg · 2024-06-07T11:52:23Z

add three document，doc_id = test-1， test-2， test-3；
use writer.delete_documents(field_name="doc_id", field_value="test-1")
writer.commit()
writer.wait_merging_threads()
index.reload()

test-1 can still be found through search...

cjrh · 2024-06-07T13:21:36Z

If you have time, we would appreciate a working code snippet that is easy to run to reproduce the issue. See http://www.sscce.org/

Fudge · 2024-06-10T09:51:45Z

I only started using Tantivy-py this weekend so I might be doing something wrong, but this doesn't behave as expected from reading the API:

import tantivy

schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("doc", stored=True)
schema = schema_builder.build()
index = tantivy.Index(schema)

writer = index.writer()

docs = ["1", "2", "3"]
for d in docs:
    writer.add_document(
        tantivy.Document(
            doc=d
        )
    )

writer.commit()
index.reload()
searcher = index.searcher()

print("Before delete:")
for d in docs:
    query = tantivy.Query.term_query(schema, "doc", d)
    top_docs = searcher.search(query)
    print(top_docs)

print("Deleting 1 and 2")
print(writer.delete_documents("doc", "1"))
print(writer.delete_documents("doc", "2"))
writer.commit()
index.reload()

print("After delete:")
for d in docs:
    query = tantivy.Query.term_query(schema, "doc", d)
    top_docs = searcher.search(query)
    print(top_docs)

Output is:

Before delete:
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Deleting 1 and 2
8
9
After delete:
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)

Fudge · 2024-06-10T15:49:39Z

Refreshing the searcher gives the expected behaviour:

print("Deleting 1 and 2")
print(writer.delete_documents("doc", "1"))
print(writer.delete_documents("doc", "2"))
writer.commit()
index.reload()
**searcher = index.searcher()**

print("After delete:")
for d in docs:
    query = tantivy.Query.term_query(schema, "doc", d)
    top_docs = searcher.search(query)
    print(top_docs)

gives:

SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Deleting 1 and 2
8
9
After delete:
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [(0.28768212, DocAddress { segment_ord: 0, doc: 0 })], count: 1)

Maybe just user error, then.

Fudge · 2024-06-10T19:07:40Z

Both I and the original reporter were bitten by term-queries not matching values with characters like - and _, so the documents weren't actually deleted.

If you change 1,2,3 to 1, 2_2, 3-3 in the example above, only the first document works as expected. Searching for the documents with a phrase query finds the others the attempted delete.

This illustrates it:

import tantivy

schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("doc", stored=True)
schema = schema_builder.build()
index = tantivy.Index(schema)

writer = index.writer()

docs = ["11", "1-2", "1_3", "1.4"]
for d in docs:
    writer.add_document(
        tantivy.Document(
            doc=d
        )
    )

writer.commit()
index.reload()
searcher = index.searcher()

print("Before delete:")
for d in docs:
    query = index.parse_query("doc:{}".format(d), ["doc"])
    top_docs = searcher.search(query)
    print(top_docs)

print("Deleting everything:")
for d in docs:
    print(writer.delete_documents("doc", d))

writer.commit()
index.reload()
searcher = index.searcher()

print("After delete:")
for d in docs:
    query = index.parse_query("doc:{}".format(d), ["doc"])
    top_docs = searcher.search(query)
    print(top_docs)

which gives:

Before delete:
SearchResult(hits: [(1.4599355, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 3, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
Deleting everything:
10
11
12
13
After delete:
SearchResult(hits: [], count: 0)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 1, doc: 0 })], count: 1)

Fudge · 2024-06-17T09:09:06Z

Just a quick follow up that changing the tokenizer to raw gives the expected behavior

schema_builder.add_text_field("doc", stored=True, tokenizer_name="raw")

and everything is deleted:

SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)

cjrh · 2024-06-17T10:03:15Z

@Fudge Thanks for looking at this ❤️

The first thing I am most interested to know is whether the deletion behaviour in tantivy-py behaves differently than the upstream tantivy crate. This might be tricky for you to investigate if you're not used to Rust.

I haven't looked into this yet but I've been following your investigation. I wonder whether we can:

inside delete_documents(),
if the target field is text,
and the field has a tokenizer,
apply the tokenizer to the given values,
and then continue to call the underlying delete function.

Does this sound like it would fix the issue?

I've been using delete_documents in tantivy always on int fields (like a doc_id field) and that always works. So yeah sounds like the tokenizer is the issue here.

Fudge · 2024-06-17T13:00:24Z

I don't think applying the tokenizer to the value would give the expected behavior in this case, as trying to delete version-1.1.0 would also delete version-1.10 and version-11.0 for example.

delete_documents() on tokenized text fields is not intuitive, and should come with a warning. :-)

tommyip · 2024-08-22T09:46:37Z

Even for integer fields by default delete_documents does not work. It turns out for all non-text field there is an indexed flag which needs to be enabled for deletion to work.

schema_builder.add_integer_field("doc_id", stored=True, indexed=True)

cjrh · 2024-08-22T12:11:32Z

there is an indexed flag which needs to be enabled for deletion to work.

Is this also how tantivy works? If so, we're not going to change the behaviour although we could certainly add documentation to warn about it.

The behaviour you show for non-indexed fields is different to what this issue is about thought, which has to do with how the field tokenizer affects match during delete. I'll edit the issue title to make that clear. The non-indexed behaviour should either be a separate issue, or if this happens also with upstream tantivy, an issue there. I suspect they will mark it as a documentation issue though.

tommyip · 2024-08-22T16:01:27Z

You are right for non-text fields this is simply a documentation issue.

cjrh added bug Something isn't working help wanted Extra attention is needed labels Jun 10, 2024

cjrh added documentation Improvements or additions to documentation and removed bug Something isn't working labels Jun 11, 2024

cjrh changed the title ~~writer.delete_documents not work~~ writer.delete_documents on text fields behaves unintuitively Aug 22, 2024

cjrh changed the title ~~writer.delete_documents on text fields behaves unintuitively~~ writer.delete_documents on tokenized fields behaves unintuitively Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`writer.delete_documents` on tokenized fields behaves unintuitively #297

`writer.delete_documents` on tokenized fields behaves unintuitively #297

lvjg commented Jun 7, 2024

cjrh commented Jun 7, 2024

Fudge commented Jun 10, 2024

Fudge commented Jun 10, 2024

Fudge commented Jun 10, 2024

Fudge commented Jun 17, 2024

cjrh commented Jun 17, 2024

Fudge commented Jun 17, 2024

tommyip commented Aug 22, 2024

cjrh commented Aug 22, 2024

tommyip commented Aug 22, 2024

writer.delete_documents on tokenized fields behaves unintuitively #297

writer.delete_documents on tokenized fields behaves unintuitively #297

Comments

lvjg commented Jun 7, 2024

cjrh commented Jun 7, 2024

Fudge commented Jun 10, 2024

Fudge commented Jun 10, 2024

Fudge commented Jun 10, 2024

Fudge commented Jun 17, 2024

cjrh commented Jun 17, 2024

Fudge commented Jun 17, 2024

tommyip commented Aug 22, 2024

cjrh commented Aug 22, 2024

tommyip commented Aug 22, 2024

`writer.delete_documents` on tokenized fields behaves unintuitively #297

`writer.delete_documents` on tokenized fields behaves unintuitively #297