Original IDs of the retrieved documents #13

WerLaj · 2022-08-08T09:35:24Z

Hi,

I am trying to use ANCEIndexer to index several datasets at once. I created one iter of dicts for three collections using itertools.chain:

corpus_iter = itertools.chain(
    wapo_generator,
    pt.get_dataset("irds:kilt").get_corpus_iter(verbose=False),
    pt.get_dataset("irds:msmarco-passage").get_corpus_iter(verbose=False),
)

where wapo_generator is my own iterable of dicts that have the keys "docno", "docid" (which is an original id of the document in the collection), and "text". The index is created and I'm able to perform a search. Now, I would like to get the original ids of the retrieved documents (the ones from original collections, e.g. "MARCO_D820886'). Is there any way to do that?

@cmacdonald @seanmacavaney @tonellotto @Xiao0728

The text was updated successfully, but these errors were encountered:

seanmacavaney · 2022-08-08T11:58:03Z

The docno field will be returned -- you just need to set that value to the original ID. (docid is ignored by the indexer, only docno and text are used.)

WerLaj · 2022-08-08T12:03:50Z

Both, docno and docid, fields are returned by search but the IDs don't correspond to the original IDs in the collections used for creating an index. I can set the value of docno using original IDs in my own generator for the WaPo collection. But what about the datasets for which I'm using the IR Datasets API?

cmacdonald · 2022-08-08T12:06:59Z

docid is internal information - its 0..N-1 for an index of N documents.

Why not add a post-retrieval transformer that gets the additional metadata you need from IRDS again. pt.text.get_text() can retrieve arbitrary metadata from an IRDS dataset very quickly. You can see us doing a similar thing in Listing 1 of https://trec.nist.gov/pubs/trec30/papers/uogTr-DL.pdf

WerLaj · 2022-08-08T12:41:19Z

I understand that you can use pt.text.get_text() for IRDS dataset. But what about the dataset that is not available via ir_datasets?

cmacdonald · 2022-08-08T12:42:27Z

Keep the mapping you want in a dataframe and join it for each query?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Original IDs of the retrieved documents #13

Original IDs of the retrieved documents #13

WerLaj commented Aug 8, 2022

seanmacavaney commented Aug 8, 2022

WerLaj commented Aug 8, 2022

cmacdonald commented Aug 8, 2022

WerLaj commented Aug 8, 2022

cmacdonald commented Aug 8, 2022

Original IDs of the retrieved documents #13

Original IDs of the retrieved documents #13

Comments

WerLaj commented Aug 8, 2022

seanmacavaney commented Aug 8, 2022

WerLaj commented Aug 8, 2022

cmacdonald commented Aug 8, 2022

WerLaj commented Aug 8, 2022

cmacdonald commented Aug 8, 2022