Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: data preprocessing in inference produces passages list that doesn't match queries list #2

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

matthmeyer
Copy link

The function data_preprocess(file) in CRAG_Inference should produce an passages array of same length of queries array. However while testing with the Popqa dataset, I realized that the passages array is much longer than the queries array.

The reason is a wrong indentation. tmp_psgs is appended to passages after every line in the preprocessed file. However, tmp_psgs should only be appended if the query is different from last line's query or at the end of looping through the lines. A different indentation fixes the bug to the intended behavior.

@HuskyInSalt
Copy link
Owner

Both passages and queries in data_preprocess(file) append new items at the same time when the input query differs from the previous line. Thus they should have the same length.

The role of tmp_psgs is to collect all retrieved passages that are retrieved with the same single query and will only be appended when the current query changes (q != queries[-1]).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants