Fix: data preprocessing in inference produces passages list that doesn't match queries list #2

matthmeyer · 2024-02-25T19:53:59Z

The function data_preprocess(file) in CRAG_Inference should produce an passages array of same length of queries array. However while testing with the Popqa dataset, I realized that the passages array is much longer than the queries array.

The reason is a wrong indentation. tmp_psgs is appended to passages after every line in the preprocessed file. However, tmp_psgs should only be appended if the query is different from last line's query or at the end of looping through the lines. A different indentation fixes the bug to the intended behavior.

HuskyInSalt · 2024-02-28T15:19:06Z

Both passages and queries in data_preprocess(file) append new items at the same time when the input query differs from the previous line. Thus they should have the same length.

The role of tmp_psgs is to collect all retrieved passages that are retrieved with the same single query and will only be appended when the current query changes (q != queries[-1]).

Fix indent to only append passages at the end of loop

6993dcc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: data preprocessing in inference produces passages list that doesn't match queries list #2

Fix: data preprocessing in inference produces passages list that doesn't match queries list #2

matthmeyer commented Feb 25, 2024

HuskyInSalt commented Feb 28, 2024

Fix: data preprocessing in inference produces passages list that doesn't match queries list #2

Are you sure you want to change the base?

Fix: data preprocessing in inference produces passages list that doesn't match queries list #2

Conversation

matthmeyer commented Feb 25, 2024

HuskyInSalt commented Feb 28, 2024