Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved compatibility with Robust04 #59

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

andreabac3
Copy link

@andreabac3 andreabac3 commented Feb 10, 2023

Hi,
I am working using ColBert and I have some issues with the indexing of Robust04.
I noticed that the code crashes if there are empty documents, so I propose to let users decide the behaviour with the optional parameter allow_empty_doc.
Additionally, the Robust04 collection uses "body" rather than "text".

I hope my pull request helps, I remain available for further changes and clarifications.

Greetings,
Andrea

- Skip of empty documents without aborting the process
@cmacdonald
Copy link
Collaborator

Thanks for this Andrea. Perhaps we can rename the parameter to skip_empty_docs?

Also, could you add a test case.

@andreabac3
Copy link
Author

Hi @cmacdonald,
Done.

I assume that the Vaswani collection has no empty documents, and therefore the test is trivial.

Can I put the following collection 'irds:disks45/nocr/trec-robust-2004' into the test?

Kind regards,
Andrea

@cmacdonald
Copy link
Collaborator

cmacdonald commented Feb 10, 2023

does vaswani have any empty documents?

agreed it wont

Can I put the following collection 'irds:disks45/nocr/trec-robust-2004' into the test?

No, not available on github, as it needs a license.

Try something like this:

indexer.index([next(iter) for i in range(200)])

->

docs = [next(iter) for i in range(200)]
docs.insert(100, {'docno': 'empty', 'text' : ''}) # truly empty
docs.insert(105, {'docno': 'empty', 'text' : ' '}) # whitespace only
factory = indexer.index(docs)
self.assertEqual(200, len(factory)) # check that empty docs are indeed ignored

@andreabac3
Copy link
Author

Done,
thank you for your support! :)

@cmacdonald
Copy link
Collaborator

I fixed various things to make the test cases not give Python errors. Its now throwing Error: Process completed with exit code 143. - i'll try to look into this later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants