-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
writer.delete_documents
on tokenized fields behaves unintuitively
#297
Comments
If you have time, we would appreciate a working code snippet that is easy to run to reproduce the issue. See http://www.sscce.org/ |
I only started using Tantivy-py this weekend so I might be doing something wrong, but this doesn't behave as expected from reading the API:
Output is:
|
Refreshing the searcher gives the expected behaviour:
gives:
Maybe just user error, then. |
Both I and the original reporter were bitten by term-queries not matching values with characters like If you change This illustrates it:
which gives:
|
Just a quick follow up that changing the tokenizer to
and everything is deleted:
|
@Fudge Thanks for looking at this ❤️ The first thing I am most interested to know is whether the deletion behaviour in tantivy-py behaves differently than the upstream tantivy crate. This might be tricky for you to investigate if you're not used to Rust. I haven't looked into this yet but I've been following your investigation. I wonder whether we can:
Does this sound like it would fix the issue? I've been using |
I don't think applying the tokenizer to the value would give the expected behavior in this case, as trying to delete
|
Even for integer fields by default
|
Is this also how tantivy works? If so, we're not going to change the behaviour although we could certainly add documentation to warn about it. The behaviour you show for non-indexed fields is different to what this issue is about thought, which has to do with how the field tokenizer affects match during delete. I'll edit the issue title to make that clear. The non-indexed behaviour should either be a separate issue, or if this happens also with upstream tantivy, an issue there. I suspect they will mark it as a documentation issue though. |
writer.delete_documents
on text fields behaves unintuitively
writer.delete_documents
on text fields behaves unintuitivelywriter.delete_documents
on tokenized fields behaves unintuitively
You are right for non-text fields this is simply a documentation issue. |
add three document,doc_id = test-1, test-2, test-3;
use writer.delete_documents(field_name="doc_id", field_value="test-1")
writer.commit()
writer.wait_merging_threads()
index.reload()
test-1 can still be found through search...
The text was updated successfully, but these errors were encountered: