Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

writer.delete_documents on tokenized fields behaves unintuitively #297

Open
lvjg opened this issue Jun 7, 2024 · 10 comments
Open

writer.delete_documents on tokenized fields behaves unintuitively #297

lvjg opened this issue Jun 7, 2024 · 10 comments
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed

Comments

@lvjg
Copy link

lvjg commented Jun 7, 2024

add three document,doc_id = test-1, test-2, test-3;
use writer.delete_documents(field_name="doc_id", field_value="test-1")
writer.commit()
writer.wait_merging_threads()
index.reload()

test-1 can still be found through search...

@cjrh
Copy link
Collaborator

cjrh commented Jun 7, 2024

If you have time, we would appreciate a working code snippet that is easy to run to reproduce the issue. See http://www.sscce.org/

@Fudge
Copy link
Contributor

Fudge commented Jun 10, 2024

I only started using Tantivy-py this weekend so I might be doing something wrong, but this doesn't behave as expected from reading the API:

import tantivy

schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("doc", stored=True)
schema = schema_builder.build()
index = tantivy.Index(schema)

writer = index.writer()

docs = ["1", "2", "3"]
for d in docs:
    writer.add_document(
        tantivy.Document(
            doc=d
        )
    )

writer.commit()
index.reload()
searcher = index.searcher()

print("Before delete:")
for d in docs:
    query = tantivy.Query.term_query(schema, "doc", d)
    top_docs = searcher.search(query)
    print(top_docs)

print("Deleting 1 and 2")
print(writer.delete_documents("doc", "1"))
print(writer.delete_documents("doc", "2"))
writer.commit()
index.reload()

print("After delete:")
for d in docs:
    query = tantivy.Query.term_query(schema, "doc", d)
    top_docs = searcher.search(query)
    print(top_docs)

Output is:

Before delete:
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Deleting 1 and 2
8
9
After delete:
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)

@cjrh cjrh added bug Something isn't working help wanted Extra attention is needed labels Jun 10, 2024
@Fudge
Copy link
Contributor

Fudge commented Jun 10, 2024

Refreshing the searcher gives the expected behaviour:

print("Deleting 1 and 2")
print(writer.delete_documents("doc", "1"))
print(writer.delete_documents("doc", "2"))
writer.commit()
index.reload()
**searcher = index.searcher()**

print("After delete:")
for d in docs:
    query = tantivy.Query.term_query(schema, "doc", d)
    top_docs = searcher.search(query)
    print(top_docs)

gives:

SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Deleting 1 and 2
8
9
After delete:
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [(0.28768212, DocAddress { segment_ord: 0, doc: 0 })], count: 1)

Maybe just user error, then.

@Fudge
Copy link
Contributor

Fudge commented Jun 10, 2024

Both I and the original reporter were bitten by term-queries not matching values with characters like - and _, so the documents weren't actually deleted.

If you change 1,2,3 to 1, 2_2, 3-3 in the example above, only the first document works as expected. Searching for the documents with a phrase query finds the others the attempted delete.

This illustrates it:

import tantivy

schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("doc", stored=True)
schema = schema_builder.build()
index = tantivy.Index(schema)

writer = index.writer()

docs = ["11", "1-2", "1_3", "1.4"]
for d in docs:
    writer.add_document(
        tantivy.Document(
            doc=d
        )
    )

writer.commit()
index.reload()
searcher = index.searcher()

print("Before delete:")
for d in docs:
    query = index.parse_query("doc:{}".format(d), ["doc"])
    top_docs = searcher.search(query)
    print(top_docs)

print("Deleting everything:")
for d in docs:
    print(writer.delete_documents("doc", d))

writer.commit()
index.reload()
searcher = index.searcher()

print("After delete:")
for d in docs:
    query = index.parse_query("doc:{}".format(d), ["doc"])
    top_docs = searcher.search(query)
    print(top_docs)

which gives:

Before delete:
SearchResult(hits: [(1.4599355, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 3, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
Deleting everything:
10
11
12
13
After delete:
SearchResult(hits: [], count: 0)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 1, doc: 0 })], count: 1)

@cjrh cjrh added documentation Improvements or additions to documentation and removed bug Something isn't working labels Jun 11, 2024
@Fudge
Copy link
Contributor

Fudge commented Jun 17, 2024

Just a quick follow up that changing the tokenizer to raw gives the expected behavior

schema_builder.add_text_field("doc", stored=True, tokenizer_name="raw")

and everything is deleted:

SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)

@cjrh
Copy link
Collaborator

cjrh commented Jun 17, 2024

@Fudge Thanks for looking at this ❤️

The first thing I am most interested to know is whether the deletion behaviour in tantivy-py behaves differently than the upstream tantivy crate. This might be tricky for you to investigate if you're not used to Rust.

I haven't looked into this yet but I've been following your investigation. I wonder whether we can:

  • inside delete_documents(),
  • if the target field is text,
  • and the field has a tokenizer,
  • apply the tokenizer to the given values,
  • and then continue to call the underlying delete function.

Does this sound like it would fix the issue?

I've been using delete_documents in tantivy always on int fields (like a doc_id field) and that always works. So yeah sounds like the tokenizer is the issue here.

@Fudge
Copy link
Contributor

Fudge commented Jun 17, 2024

I don't think applying the tokenizer to the value would give the expected behavior in this case, as trying to delete version-1.1.0 would also delete version-1.10 and version-11.0 for example.

delete_documents() on tokenized text fields is not intuitive, and should come with a warning. :-)

@tommyip
Copy link

tommyip commented Aug 22, 2024

Even for integer fields by default delete_documents does not work. It turns out for all non-text field there is an indexed flag which needs to be enabled for deletion to work.

schema_builder.add_integer_field("doc_id", stored=True, indexed=True)

@cjrh
Copy link
Collaborator

cjrh commented Aug 22, 2024

there is an indexed flag which needs to be enabled for deletion to work.

Is this also how tantivy works? If so, we're not going to change the behaviour although we could certainly add documentation to warn about it.

The behaviour you show for non-indexed fields is different to what this issue is about thought, which has to do with how the field tokenizer affects match during delete. I'll edit the issue title to make that clear. The non-indexed behaviour should either be a separate issue, or if this happens also with upstream tantivy, an issue there. I suspect they will mark it as a documentation issue though.

@cjrh cjrh changed the title writer.delete_documents not work writer.delete_documents on text fields behaves unintuitively Aug 22, 2024
@cjrh cjrh changed the title writer.delete_documents on text fields behaves unintuitively writer.delete_documents on tokenized fields behaves unintuitively Aug 22, 2024
@tommyip
Copy link

tommyip commented Aug 22, 2024

You are right for non-text fields this is simply a documentation issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants