-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved compatibility with Robust04 #59
base: main
Are you sure you want to change the base?
Conversation
- Skip of empty documents without aborting the process
Thanks for this Andrea. Perhaps we can rename the parameter to Also, could you add a test case. |
Hi @cmacdonald, I assume that the Vaswani collection has no empty documents, and therefore the test is trivial. Can I put the following collection 'irds:disks45/nocr/trec-robust-2004' into the test? Kind regards, |
agreed it wont
No, not available on github, as it needs a license. Try something like this: indexer.index([next(iter) for i in range(200)]) -> docs = [next(iter) for i in range(200)]
docs.insert(100, {'docno': 'empty', 'text' : ''}) # truly empty
docs.insert(105, {'docno': 'empty', 'text' : ' '}) # whitespace only
factory = indexer.index(docs)
self.assertEqual(200, len(factory)) # check that empty docs are indeed ignored |
Done, |
I fixed various things to make the test cases not give Python errors. Its now throwing |
Hi,
I am working using ColBert and I have some issues with the indexing of Robust04.
I noticed that the code crashes if there are empty documents, so I propose to let users decide the behaviour with the optional parameter allow_empty_doc.
Additionally, the Robust04 collection uses "body" rather than "text".
I hope my pull request helps, I remain available for further changes and clarifications.
Greetings,
Andrea