Performance of WizardCoder on HumanEvalFixDocs #18

awasthiabhijeet · 2023-09-04T06:43:55Z

Thank you for releasing many useful resources.

QQ: Do you know what is the accuracy of WizardCoder-15.5B on HumanEvalFixDocs? (i.e. where does WizardCoder stand in Table12 of your paper?)

Muennighoff · 2023-09-04T06:49:16Z

We didn't run that. I think it would be somewhere between OctoCoder & GPT-4.
You can run it easily like below tho:

accelerate launch main.py \
--model WizardCoder-15.5B  \
--tasks humanevalfixdocs-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt wizardcoder \
--save_generations_path generations_humanevalfixdocspython_wizardcoder.json \
--metric_output_path evaluation_humanevalfixdocspython_wizardcoder.json \
--max_length_generation 2048 \
--precision bf16

awasthiabhijeet · 2023-09-04T07:28:24Z

Thanks!

awasthiabhijeet · 2023-09-05T05:24:46Z

I observe an accuracy of 51.2 with WizardCoder. (Thus lower than OctoCoder, and also lower than WizardCoder's performance on HumanEval)

PS: I'm reporting with greedy decoding and pass@1 score.

As a sanity check, would it be possible for you to confirm if you are observing the same performance?
CC: @Muennighoff

awasthiabhijeet · 2023-09-05T06:35:11Z

With Greedy decoding, StarCoder gives a pass@1 of 61.6

Muennighoff · 2023-09-05T14:26:08Z

I observe an accuracy of 51.2 with WizardCoder. (Thus lower than OctoCoder, and also lower than WizardCoder's performance on HumanEval)

PS: I'm reporting with greedy decoding and pass@1 score.

As a sanity check, would it be possible for you to confirm if you are observing the same performance? CC: @Muennighoff

Using --temperature 0.2 --n_samples 20 would likely increase the score a bit.

For StarCoder which prompt are you using? Surprised it is that high. Would be curious to know what you get for --temperature 0.2 --n_samples 20

awasthiabhijeet · 2023-09-06T05:19:34Z

I am using starcodercommit prompt, when I get 61.6

accelerate launch main.py \
--model $MODEL_DIR \
--tasks humanevalfixdocs-python \
--do_sample False \
--batch_size 1 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt starcodercommit \
--save_generations_path $MODEL_DIR/generations_humanevalfixdocspython_starcodercommit_prompt.json \
--metric_output_path $MODEL_DIR/evaluation_humanevalfixdocspython_starcodercommit_prompt.json \
--max_length_generation 2048 \
--precision fp16

Will try to run with 20 samples and temperature 0.2.

CC: @Muennighoff

awasthiabhijeet · 2023-09-06T09:46:46Z

With 20 samples and T=0.2, I observe the following result (pass@1 of 58.9 compared to 43.5 reported in the paper..)

{
  "humanevalfixdocs-python": {
    "pass@1": 0.589329268292683,
    "pass@10": 0.6989868047455075
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 20,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "starcoder",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humanevalfixdocs-python",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 2048,
    "precision": "fp16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "limit": null,
    "limit_start": 0,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "starcoder/evaluation_humanevalfixdocspython_starcodercommit_sample_prompt.json",
    "save_generations": true,
    "save_generations_path": "starcoder/generations_humanevalfixdocspython_starcodercommit_sample_prompt.json",
    "save_references": false,
    "prompt": "starcodercommit",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

Would it be possible for you to re-compute these numbers just to be sure?

Muennighoff · 2023-09-06T10:56:13Z

Discussion moved to #21

awasthiabhijeet closed this as completed Sep 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of WizardCoder on HumanEvalFixDocs #18

Performance of WizardCoder on HumanEvalFixDocs #18

awasthiabhijeet commented Sep 4, 2023

Muennighoff commented Sep 4, 2023

awasthiabhijeet commented Sep 4, 2023

awasthiabhijeet commented Sep 5, 2023

awasthiabhijeet commented Sep 5, 2023

Muennighoff commented Sep 5, 2023

awasthiabhijeet commented Sep 6, 2023 •

edited

Loading

awasthiabhijeet commented Sep 6, 2023

Muennighoff commented Sep 6, 2023

Performance of WizardCoder on HumanEvalFixDocs #18

Performance of WizardCoder on HumanEvalFixDocs #18

Comments

awasthiabhijeet commented Sep 4, 2023

Muennighoff commented Sep 4, 2023

awasthiabhijeet commented Sep 4, 2023

awasthiabhijeet commented Sep 5, 2023

awasthiabhijeet commented Sep 5, 2023

Muennighoff commented Sep 5, 2023

awasthiabhijeet commented Sep 6, 2023 • edited Loading

awasthiabhijeet commented Sep 6, 2023

Muennighoff commented Sep 6, 2023

awasthiabhijeet commented Sep 6, 2023 •

edited

Loading