Skip to content

Latest commit

 

History

History
132 lines (119 loc) · 6.67 KB

01-lm-eval.md

File metadata and controls

132 lines (119 loc) · 6.67 KB

We use the EleutherAI lm-evaluation-harness to evaluate text-davinci-003

We install this with pip install lm-eval - it installs a lot of stuff (and requires a bunch of dependencies that aren't automatically installed like pytest and a few others). It also takes quite a while (about an hour?) to download the tests even on a fast connection.

We run it in a shell script like so:

python main.py \
    --model gpt3 \
    --model_args engine=text-davinci-003 \
    --tasks lambada_openai,lambada_standard,hellaswag,winogrande,piqa,coqa \
    --check_integrity

While gpt-3.5-turbo is supposed to perform on par with Instruct Davinci, you can only use it with the Chat API. It's cost is $0.002 / 1K tokens, while Davinci is $0.0200 / 1K tokens. At these prices, running the above eval cost ~$90.77 (~4.5M tokens) and about 1h to run the tasks.

Results:

gpt3 (engine=text-davinci-003), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
Task Version Metric Value Stderr
winogrande 0 acc 0.7553 ± 0.0121
lambada_openai 0 ppl 2.6089 ± 0.0576
acc 0.7444 ± 0.0061
piqa 0 acc 0.8319 ± 0.0087
acc_norm 0.8384 ± 0.0086
hellaswag 0 acc 0.6780 ± 0.0047
acc_norm 0.8333 ± 0.0037
lambada_standard 0 ppl 3.0489 ± 0.0754
acc 0.7075 ± 0.0063
coqa 1 f1 0.7589 ± 0.0139
em 0.5843 ± 0.0196
  • JSON at the bottom

Fabrice Ballard has run this set of evals on many of the models (and quantizations!) that we'd want to compare with: https://bellard.org/ts_server/ Here's where text-davinci-003 sits:

Model RAM lambada (ppl) lambada (acc) hellaswag (acc_norm) winogrande (acc) piqa (acc) coqa (f1) average
llama_65B_q4 39 2.76 78.50% 83.90% 76.60% 81.40% 83.20% 80.70%
llama_30B_q8 36 2.853 77.70% 82.70% 76.30% 80.30% 80.40% 79.50%
llama_30B_q4 20 2.877 77.50% 82.40% 75.70% 80.20% 80.20% 79.20%
text-davinci-003 n/a 3.0489 70.75% 83.33% 75.53% 83.19% 75.89% 77.74%
llama_13B_q8 15 3.178 76.50% 79.10% 73.20% 79.10% 77.10% 77.00%
llama_13B_q4 8 3.13 77.10% 78.60% 72.20% 78.30% 77.80% 76.80%
flan_t5_xxl_q8 13 3.049 77.80% 72.10% 75.10% 77.80% 73.10% 75.20%
llama_7B 14 3.463 73.60% 76.20% 70.40% 78.10% 75.40% 74.70%
llama_7B_q8 8 3.453 73.70% 76.10% 70.20% 78.00% 75.50% 74.70%
llama_7B_q4 5 3.549 73.20% 75.50% 70.40% 78.00% 74.70% 74.40%
flan_t5_xxl_q4 7 3.01 77.70% 71.50% 73.40% 77.60% 71.80% 74.40%
opt_66B_q4 40 3.308 73.40% 74.40% 68.40% 78.50% 75.00% 73.90%
opt_30B_q8 34 3.628 71.60% 72.30% 68.20% 77.70% 71.40% 72.30%
gptneox_20B_q8 23 3.659 72.60% 71.30% 65.80% 77.30% 72.90% 72.00%
gptneox_20B 43 3.657 72.60% 71.40% 65.50% 77.50% 73.30% 72.00%
fairseq_gpt_13B_bf4 9 3.646 71.20% 72.50% 67.60% 77.40% 70.60% 71.90%
fairseq_gpt_13B 27 3.567 71.90% 72.70% 67.50% 77.60% 70.10% 71.90%
fairseq_gpt_13B_bf8 15 3.565 71.80% 72.70% 67.20% 77.70% 70.00% 71.90%
opt_30B_q4 19 3.656 71.50% 72.10% 68.00% 77.40% 69.90% 71.80%
gptneox_20B_q4 13 3.711 72.00% 69.30% 64.80% 76.70% 70.80% 70.70%

We can add in other sources of lm-eval results like:

claude-instant-v1 should perform on about the same level as gpt-3.5-turbo:

See also:


Full JSON results:

{
  "results": {
    "winogrande": {
      "acc": 0.755327545382794,
      "acc_stderr": 0.012082125654159738
    },
    "lambada_openai": {
      "ppl": 2.6088786186020605,
      "ppl_stderr": 0.05758481395456482,
      "acc": 0.7444207257908014,
      "acc_stderr": 0.006076928367674913
    },
    "piqa": {
      "acc": 0.8318824809575626,
      "acc_stderr": 0.008725350811241683,
      "acc_norm": 0.838411316648531,
      "acc_norm_stderr": 0.008587751299447739
    },
    "hellaswag": {
      "acc": 0.6779525990838479,
      "acc_stderr": 0.004663060828376781,
      "acc_norm": 0.8333001394144592,
      "acc_norm_stderr": 0.003719459738399207
    },
    "lambada_standard": {
      "ppl": 3.0489055102298397,
      "ppl_stderr": 0.07538504687766555,
      "acc": 0.7075490005821852,
      "acc_stderr": 0.006337484186544313
    },
    "coqa": {
      "f1": 0.7588553465756964,
      "f1_stderr": 0.013933087384831054,
      "em": 0.5843333333333334,
      "em_stderr": 0.01957429935602932
    }
  },
  "versions": {
    "winogrande": 0,
    "lambada_openai": 0,
    "piqa": 0,
    "coqa": 1,
    "hellaswag": 0,
    "lambada_standard": 0
  },
  "config": {
    "model": "gpt3",
    "model_args": "engine=text-davinci-003",
    "num_fewshot": 0,
    "batch_size": null,
    "device": null,
    "no_cache": false,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
  }
}