You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the great job. I am a little bit confused about eval.py. In the paper, accuracy is used as the evaluation metric for arc_challenge, but in the actual code, match is indeed used as the metric. Are these two the same? When testing accuracy, why is there an output key in the data?
Thanks.
The text was updated successfully, but these errors were encountered:
Hi @wtc9806 , you can find that arc_challenge is a dataset that consists of questions with multiple choices. The current evaluation method is to match the predicted option with the golden label, which also means the accuracy of the predictions. In fact, the term accuracy we used in the paper and the metric functions in the evaluation code are both consistent with Self-RAG.
Hi authors,
Thanks for the great job. I am a little bit confused about eval.py. In the paper, accuracy is used as the evaluation metric for arc_challenge, but in the actual code, match is indeed used as the metric. Are these two the same? When testing accuracy, why is there an output key in the data?
Thanks.
The text was updated successfully, but these errors were encountered: