Clarification needed in evaluation numbers #5

saurabhkumar8112 · 2023-12-12T20:22:55Z

Hello,
Thanks for the repo and awesome work. I am requesting clarification on the evaluation results shown in the repo.

For humanEval Zero shot, GPT-4's score is reported here as 87.4 but in the Gemini Report and GPT-4 paper(and everywhere else), humanEval score for GPT-4 Zero shot is 67.

Is the "Zero-shot" prompt technique mentioned in the repo followed by Medprompt methodology? If yes, please clarify.
For MMLU is explicitly clear but not for others.

Apologies if I missed anything.

dzunglt24 · 2023-12-14T22:36:34Z

@saurabhkumar8112 look into their code, i guess it is standard zero-shot result using newest GPT-4 checkpoint.

Harsha-Nori · 2023-12-15T20:00:21Z

Yes @dzunglt24 is right -- we do have all the code we used to run on HumanEval here, and it is zero-shot with the latest GPT-4 checkpoint. The numbers reported in the OpenAI report are from many months ago, and it's likely that there have been both model improvements, and subtlety in prompting differences (even in the zero shot setting) that leads to our improved performance number here.

I believe others have found that the GPT-4 numbers were underreported in the Technical Report as well, e.g. see: https://twitter.com/OwariDa/status/1732423557802782854

Our HumanEval scripts/prompt are:

promptbase/src/promptbase/humaneval/humaneval.py

Line 16 in f43cf97

for row in ds:

saurabhkumar8112 · 2023-12-16T06:06:43Z

I see. That’s good to know.
Then that means the Gemini Report had under-reported numbers for GPT4(as the numbers were from old checkpoint)?

Divkovicalex75 · 2023-12-16T07:11:07Z

UYQA~~KAQV~~8743~~RXXR~~MUUB~~2GTT~~RFF8

Harsha-Nori · 2023-12-18T05:29:39Z

I see. That’s good to know. Then that means the Gemini Report had under-reported numbers for GPT4(as the numbers were from old checkpoint)?

I believe the Gemini Report cited + pulled the Humaneval numbers directly from OpenAI's initial GPT-4 technical report (which was released in March alongside the first version of the model). We just happened to run our own zero-shot prompts against a more recent checkpoint so we have updated numbers here.

saurabhkumar8112 changed the title ~~Clarification need in evaluation numbers~~ Clarification needed in evaluation numbers Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification needed in evaluation numbers #5

Clarification needed in evaluation numbers #5

saurabhkumar8112 commented Dec 12, 2023

dzunglt24 commented Dec 14, 2023

Harsha-Nori commented Dec 15, 2023

saurabhkumar8112 commented Dec 16, 2023

Divkovicalex75 commented Dec 16, 2023

Harsha-Nori commented Dec 18, 2023

Clarification needed in evaluation numbers #5

Clarification needed in evaluation numbers #5

Comments

saurabhkumar8112 commented Dec 12, 2023

dzunglt24 commented Dec 14, 2023

Harsha-Nori commented Dec 15, 2023

saurabhkumar8112 commented Dec 16, 2023

Divkovicalex75 commented Dec 16, 2023

Harsha-Nori commented Dec 18, 2023