Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Original IDs of the retrieved documents #13

Open
WerLaj opened this issue Aug 8, 2022 · 5 comments
Open

Original IDs of the retrieved documents #13

WerLaj opened this issue Aug 8, 2022 · 5 comments

Comments

@WerLaj
Copy link

WerLaj commented Aug 8, 2022

Hi,

I am trying to use ANCEIndexer to index several datasets at once. I created one iter of dicts for three collections using itertools.chain:

corpus_iter = itertools.chain(
    wapo_generator,
    pt.get_dataset("irds:kilt").get_corpus_iter(verbose=False),
    pt.get_dataset("irds:msmarco-passage").get_corpus_iter(verbose=False),
)

where wapo_generator is my own iterable of dicts that have the keys "docno", "docid" (which is an original id of the document in the collection), and "text". The index is created and I'm able to perform a search. Now, I would like to get the original ids of the retrieved documents (the ones from original collections, e.g. "MARCO_D820886'). Is there any way to do that?

@cmacdonald @seanmacavaney @tonellotto @Xiao0728

@seanmacavaney
Copy link
Contributor

The docno field will be returned -- you just need to set that value to the original ID. (docid is ignored by the indexer, only docno and text are used.)

@WerLaj
Copy link
Author

WerLaj commented Aug 8, 2022

Both, docno and docid, fields are returned by search but the IDs don't correspond to the original IDs in the collections used for creating an index. I can set the value of docno using original IDs in my own generator for the WaPo collection. But what about the datasets for which I'm using the IR Datasets API?

@cmacdonald
Copy link
Contributor

docid is internal information - its 0..N-1 for an index of N documents.

Why not add a post-retrieval transformer that gets the additional metadata you need from IRDS again. pt.text.get_text() can retrieve arbitrary metadata from an IRDS dataset very quickly. You can see us doing a similar thing in Listing 1 of https://trec.nist.gov/pubs/trec30/papers/uogTr-DL.pdf

@WerLaj
Copy link
Author

WerLaj commented Aug 8, 2022

I understand that you can use pt.text.get_text() for IRDS dataset. But what about the dataset that is not available via ir_datasets?

@cmacdonald
Copy link
Contributor

Keep the mapping you want in a dataframe and join it for each query?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants