Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of WizardCoder on HumanEvalFixDocs #18

Closed
awasthiabhijeet opened this issue Sep 4, 2023 · 8 comments
Closed

Performance of WizardCoder on HumanEvalFixDocs #18

awasthiabhijeet opened this issue Sep 4, 2023 · 8 comments

Comments

@awasthiabhijeet
Copy link

Hi @Muennighoff,

Thank you for releasing many useful resources.

QQ: Do you know what is the accuracy of WizardCoder-15.5B on HumanEvalFixDocs? (i.e. where does WizardCoder stand in Table12 of your paper?)

@Muennighoff
Copy link
Collaborator

We didn't run that. I think it would be somewhere between OctoCoder & GPT-4.
You can run it easily like below tho:

accelerate launch main.py \
--model WizardCoder-15.5B  \
--tasks humanevalfixdocs-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt wizardcoder \
--save_generations_path generations_humanevalfixdocspython_wizardcoder.json \
--metric_output_path evaluation_humanevalfixdocspython_wizardcoder.json \
--max_length_generation 2048 \
--precision bf16

@awasthiabhijeet
Copy link
Author

Thanks!

@awasthiabhijeet
Copy link
Author

I observe an accuracy of 51.2 with WizardCoder. (Thus lower than OctoCoder, and also lower than WizardCoder's performance on HumanEval)

PS: I'm reporting with greedy decoding and pass@1 score.

As a sanity check, would it be possible for you to confirm if you are observing the same performance?
CC: @Muennighoff

@awasthiabhijeet
Copy link
Author

With Greedy decoding, StarCoder gives a pass@1 of 61.6
image

@Muennighoff
Copy link
Collaborator

I observe an accuracy of 51.2 with WizardCoder. (Thus lower than OctoCoder, and also lower than WizardCoder's performance on HumanEval)

PS: I'm reporting with greedy decoding and pass@1 score.

As a sanity check, would it be possible for you to confirm if you are observing the same performance? CC: @Muennighoff

Using --temperature 0.2 --n_samples 20 would likely increase the score a bit.

For StarCoder which prompt are you using? Surprised it is that high. Would be curious to know what you get for --temperature 0.2 --n_samples 20

@awasthiabhijeet
Copy link
Author

awasthiabhijeet commented Sep 6, 2023

I am using starcodercommit prompt, when I get 61.6

accelerate launch main.py \
--model $MODEL_DIR \
--tasks humanevalfixdocs-python \
--do_sample False \
--batch_size 1 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt starcodercommit \
--save_generations_path $MODEL_DIR/generations_humanevalfixdocspython_starcodercommit_prompt.json \
--metric_output_path $MODEL_DIR/evaluation_humanevalfixdocspython_starcodercommit_prompt.json \
--max_length_generation 2048 \
--precision fp16

Will try to run with 20 samples and temperature 0.2.

CC: @Muennighoff

@awasthiabhijeet
Copy link
Author

With 20 samples and T=0.2, I observe the following result (pass@1 of 58.9 compared to 43.5 reported in the paper..)

{
  "humanevalfixdocs-python": {
    "pass@1": 0.589329268292683,
    "pass@10": 0.6989868047455075
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 20,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "starcoder",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humanevalfixdocs-python",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 2048,
    "precision": "fp16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "limit": null,
    "limit_start": 0,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "starcoder/evaluation_humanevalfixdocspython_starcodercommit_sample_prompt.json",
    "save_generations": true,
    "save_generations_path": "starcoder/generations_humanevalfixdocspython_starcodercommit_sample_prompt.json",
    "save_references": false,
    "prompt": "starcodercommit",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

Would it be possible for you to re-compute these numbers just to be sure?

@Muennighoff
Copy link
Collaborator

Discussion moved to #21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants