-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with MVBench Evaluation #227
Comments
I apologize for not noticing your issue sooner, and I appreciate you bringing it to my attention. I agree that the current evaluation method needs improvement. Relying solely on the first segment for correctness verification can lead to inaccuracies, especially when a standalone closing parenthesis is accepted as correct. Adding a step to verify the alphabet of the answer option seems like a crucial enhancement. This would help ensure that predictions are valid. I also appreciate your observation about the extra space in the answer prompt and its impact on performance. It highlights the importance of refining our model and evaluation criteria to avoid such pitfalls. I have also noticed this issue in the MVBench of lmms-eval and have made corresponding modifications, which you can check here: MVBench in lmms-eval. Thank you for your insights! Here’s a refined version of function def check_ans(pred, gt):
flag = False
# Split predictions and ground truth into options and content
pred_list = pred.lower().split(' ')
pred_option, pred_content = pred_list[0], ' '.join(pred_list[1:])
gt_list = gt.lower().split(' ')
gt_option, gt_content = gt_list[0], ' '.join(gt_list[1:])
# Remove trailing period from ground truth content if present
if gt_content.endswith('.'):
gt_content = gt_content[:-1]
# Clean options by removing certain characters
pred_option = pred_option.replace('.', '').replace('(', '').replace(')', '')
gt_option = gt_option.replace('.', '').replace('(', '').replace(')', '')
# Additional check: if pred_option does not contain any answer a-e, return False
if not any(char in pred_option for char in 'abcde'):
return False
# Check for equality or inclusion
if pred_option == gt_option:
flag = True
elif gt_option in pred_option:
flag = True
return flag |
It seems that there is an issue with the evaluation method in MVBench. Currently, the process of verifying correctness involves splitting the prediction and extracting only the first segment(word) to compare it with the correct answer. However, this approach causes any prediction that includes just a closing parenthesis “)” to be treated as correct.
I believe it is essential to add a step that verifies whether the alphabet of the answer option is correct.
While running your code, I mistakenly added an extra space in the answer prompt, making it “best option: ( “ and noticed a significant increase in performance.
It would be great if the evaluation method could be made more robust!
The text was updated successfully, but these errors were encountered: