The upper bound of Inform and Success rate? #20

yunhaoli1995 · 2020-04-28T07:28:04Z

I run evaluate.py and get Matches(inform): 90.40, Success 82.3. Are these the upper bound of metric Inform and Success? In some paper, the inform and success rate can exceed 90.40,82.3. In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.

budzianowski · 2020-05-05T20:45:11Z

Hi, can you explain in more detailed what models you evaluated?

yunhaoli1995 · 2020-05-06T04:49:42Z

Sorry,i didn't speak clearly. I evaluated the data/test_dials, the ground truth using the script evaluate.py and get Matches(inform): 90.40, Success 82.3, i want to know are these the upper bound of metric inform and success? In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.

skiingpacman · 2020-09-08T19:05:06Z

I think there is a likely upper bound on Inform Rate of 91.6% on the MultiWOZ 2.0 test of set due to a combination of the implementation of Inform Rate and errors in the belief state in the test set. This is based on the metric internally using the test-set to provide the "oracle" belief state, when sampling venues that the policy presents.

In practice evaluating the test set dialogues themselves (as per @leeyunhao) I got min 90.3%, max 90.9%, mean 90.54% +/- 0.46% (+/- 2 * STD) using 5 samples.

For more details see comment: #2 (comment)

comprehensiveMap · 2021-04-30T13:34:23Z

Sorry,i didn't speak clearly. I evaluated the data/test_dials, the ground truth using the script evaluate.py and get Matches(inform): 90.40, Success 82.3, i want to know are these the upper bound of metric inform and success? In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.

Hello, I am confused by this too. Have you solved this problem? In my opinion, there are some differences in the DAMD evaluation scripts, DAMD considers 'match' as 1 if the set of returned venues has overlap with the set of true venues. But in this script, as you see, the randomly selected one should be included in the set of true venues.

yunhaoli1995 · 2021-05-10T14:36:56Z

Sorry,i didn't speak clearly. I evaluated the data/test_dials, the ground truth using the script evaluate.py and get Matches(inform): 90.40, Success 82.3, i want to know are these the upper bound of metric inform and success? In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.

Hello, I am confused by this too. Have you solved this problem? In my opinion, there are some differences in the DAMD evaluation scripts, DAMD considers 'match' as 1 if the set of returned venues has overlap with the set of true venues. But in this script, as you see, the randomly selected one should be included in the set of true venues.

It's still unsolved. But at least I think models should be compared on the same evaluation script, otherwise, the comparison is meanless.

yunhaoli1995 closed this as completed May 10, 2021

yunhaoli1995 reopened this May 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The upper bound of Inform and Success rate? #20

The upper bound of Inform and Success rate? #20

yunhaoli1995 commented Apr 28, 2020 •

edited

Loading

budzianowski commented May 5, 2020

yunhaoli1995 commented May 6, 2020

skiingpacman commented Sep 8, 2020 •

edited

Loading

comprehensiveMap commented Apr 30, 2021

yunhaoli1995 commented May 10, 2021 •

edited

Loading

The upper bound of Inform and Success rate? #20

The upper bound of Inform and Success rate? #20

Comments

yunhaoli1995 commented Apr 28, 2020 • edited Loading

budzianowski commented May 5, 2020

yunhaoli1995 commented May 6, 2020

skiingpacman commented Sep 8, 2020 • edited Loading

comprehensiveMap commented Apr 30, 2021

yunhaoli1995 commented May 10, 2021 • edited Loading

yunhaoli1995 commented Apr 28, 2020 •

edited

Loading

skiingpacman commented Sep 8, 2020 •

edited

Loading

yunhaoli1995 commented May 10, 2021 •

edited

Loading