-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The upper bound of Inform and Success rate? #20
Comments
Hi, can you explain in more detailed what models you evaluated? |
Sorry,i didn't speak clearly. I evaluated the data/test_dials, the ground truth using the script evaluate.py and get Matches(inform): 90.40, Success 82.3, i want to know are these the upper bound of metric inform and success? In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much. |
I think there is a likely upper bound on Inform Rate of 91.6% on the MultiWOZ 2.0 test of set due to a combination of the implementation of Inform Rate and errors in the belief state in the test set. This is based on the metric internally using the test-set to provide the "oracle" belief state, when sampling venues that the policy presents. In practice evaluating the test set dialogues themselves (as per @leeyunhao) I got min 90.3%, max 90.9%, mean 90.54% +/- 0.46% (+/- 2 * STD) using 5 samples. For more details see comment: #2 (comment) |
Hello, I am confused by this too. Have you solved this problem? In my opinion, there are some differences in the DAMD evaluation scripts, DAMD considers 'match' as 1 if the set of returned venues has overlap with the set of true venues. But in this script, as you see, the randomly selected one should be included in the set of true venues. |
It's still unsolved. But at least I think models should be compared on the same evaluation script, otherwise, the comparison is meanless. |
I run evaluate.py and get Matches(inform): 90.40, Success 82.3. Are these the upper bound of metric Inform and Success? In some paper, the inform and success rate can exceed 90.40,82.3. In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.
The text was updated successfully, but these errors were encountered: