Fix: data preprocessing in inference produces passages list that doesn't match queries list #2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The function
data_preprocess(file)
in CRAG_Inference should produce an passages array of same length of queries array. However while testing with the Popqa dataset, I realized that the passages array is much longer than the queries array.The reason is a wrong indentation.
tmp_psgs
is appended topassages
after every line in the preprocessed file. However,tmp_psgs
should only be appended if the query is different from last line's query or at the end of looping through the lines. A different indentation fixes the bug to the intended behavior.