This repository contains the code implementation for the paper Functional Overlap Reranking for Neural Code Generation, accepted as a long paper to ACL Findings 2024.
Authors: Hung Q. To, Minh H. Nguyen, Nghi D. Q. Bui
We introduce SRank, a novel reranking strategy for selecting the best solutions from code generation models, focusing on modeling the relationships between clusters of solutions. By quantifying the functional overlap between solution clusters, our approach provides a superior ranking strategy for code solutions. Empirical results demonstrate that our method achieves remarkable improvements in the pass@1 score. For instance, on the Human-Eval benchmark, we achieve 69.66% with Codex002, 75.31% with WizardCoder, 53.99% with StarCoder, and 60.55% with CodeGen, significantly surpassing state-of-the-art code generation reranking methods like CodeT and Coder-Reviewer by an average margin of ≈6.1%. Compared to random sampling, we observe an average improvement of ≈23.07% on Human-Eval and 17.64% on MBPP, showcasing the robustness and superiority of our approach even in scenarios with limited test inputs.
The tables below show the pass@1
results of SRank on various benchmarks in the zero-shot setting compared to baselines and state-of-the-art methods.
WizardCoder34B | WizardCoder15B | CodeGen2.5-Instruct | StarCoder | Codex002 | CodeGen16B | |
---|---|---|---|---|---|---|
Greedy | 68.90 | 50.61 | 28.05 | 39.63 | 47.00 | 29.70 |
CodeT | 72.36 | 58.64 | 56.81 | 50.51 | 65.80 | 36.70 |
Coder-Reviewer | - | 49.37 | 45.63 | 38.71 | 66.90 | 42.60 |
Random | 59.88 | 45.20 | 26.68 | 32.55 | 37.06 | 22.78 |
SRank | 75.31 | 59.99 | 60.55 | 53.99 | 69.66 | 43.07 |
Table 1: Results of pass@1 on HumanEval.
WizardCoder34B | WizardCoder15B | CodeGen2.5-Instruct | StarCoder | Codex002 | CodeGen16B | |
---|---|---|---|---|---|---|
Greedy | 60.42 | 51.29 | 42.86 | 45.90 | 58.10 | 42.40 |
CodeT | 63.39 | 58.18 | 55.02 | 58.05 | 67.70 | 49.50 |
Coder-Reviewer | - | 52.52 | 52.74 | 49.48 | 64.70 | 50.30 |
Random | 54.37 | 45.72 | 34.60 | 39.26 | 47.50 | 31.54 |
SRank | 64.14 | 59.01 | 57.02 | 58.38 | 69.25 | 51.03 |
Table 2: Results of pass@1 on MBPP-S.
Method | Introduction | Interview | Competition |
---|---|---|---|
Random | 20.35 | 3.11 | 0.74 |
Greedy | 27.20 | 5.10 | 1.80 |
CodeT | 34.60 | 8.10 | 2.20 |
SRank | 37.79 | 9.53 | 3.29 |
Table 3: Results of pass@1 on APPS benchmark using Codex002.
Please refer to our paper for detailed explanations of these results and additional findings, including ablation studies.
To set up the environment and dependencies, follow these steps:
- Ensure you have Python 3.9.17 installed.
- Install pyminifier from source. Note that you may need to revert setuptools to an older version:
pip install setuptools==57.5.0
. Refer to the pyminifier issues for potential fixes. - Install
human-eval
from source. - Install additional dependencies:
pip install -r requirements.txt
This repository facilitates conducting experiments with the models and datasets listed in our paper.
- wizardcoder34B
- wizardcoder15B
- codegen25
- starcoder
- davinci002
- codegen16B
- humaneval
- mbpp
- apps
Our CodeLLM-based code generation process involves three main steps:
- CodeLLM-based Generation
- Code solution generation
- Test case generation
- Post-processing code solutions and test cases
- Code Execution
- Reranking
device_ids
: GPU device IDsmodel
: Select one from the available models listed abovedataset
: Select one from the available datasets listed abovemax_sequence_length
: Max sequence length for LLMnumber_of_sequences
: Number of samples drawn from LLMrunning_script
: Python script for the corresponding modelreranking_method
: Reranking method applied to code solution clusters (options:random
,srank
)
Default hyperparameters: temperature=0.8
, top_p=0.95
.
To generate code solutions, navigate to the appropriate directory and run the script:
cd generation/gen_code/sh
./run.sh ${device_ids} ${model} ${dataset} ${max_sequence_length} ${number_of_sequences} ${running_script}
Example:
./run.sh 0,1,2,3 wizardcoder humaneval 2048 8 wizardcoder.py
Post-process the raw data:
./postprocess.sh ${model} ${dataset}
Results are saved to preds/${dataset}/${model}/postprocessed_T${temperature}_N${num_samples}.jsonl
.
Navigate to the test case generation directory and run the script:
cd generation/gen_test/sh
./run.sh ${device_ids} ${model} ${dataset} ${max_sequence_length} ${number_of_sequences}
Example:
./run.sh 0,1,2,3 wizardcoder humaneval 2048 8 wizardcoder.py
Post-process the test cases:
./postprocess.sh ${model} ${dataset}
Results are saved to preds/${dataset}/${model}/postprocessed_T${temperature}_N${num_samples}.jsonl
.
Navigate to the execution directory and run the command:
cd execution/sh
./run.sh ${model} ${dataset}
Execution results are saved to results/${dataset}/${model}/T${temperature}_N${num_samples}/
.
Navigate to the reranking directory and run the script:
cd reranking/sh
./run.sh ${model} ${dataset} ${temperature} ${num_samples} ${reranking_method}
Example:
./run.sh wizardcoder humaneval 0.8 100 srank
This code base is adapted from:
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or collaborations, please contact:
- Hung Quoc To
- Email: [email protected]