Skip to content

This repository contains code base for project titled Leveraging static analysis for evaluating code-generation models developed during the CSCI 544 Applied Natural Language Processing course, Fall 2023, at the University of Southern California (USC).

License

Notifications You must be signed in to change notification settings

ksanu1998/static_analysis_codegen_llms

Repository files navigation

Leveraging static analysis for evaluating code-generation models

In recent times, the utilization of Large Language Models (LLMs) for code generation has gained substantial traction. Tools such as ChatGPT, GitHub CoPilot, Code Llama, Bard, and the pioneering work of Rozière et al. with Code Llama aim to streamline developer workflows and expedite development cycles. Despite their promising prospects, code produced by these tools often suffers from bugs, hampering their overall utility. While existing methodologies primarily focus on resource-intensive runtime analysis to address these issues, research exploring static analysis, especially across a limited range of programming languages, remains scarce.

Our study aims to enrich the baseline code generation model by incorporating insights from static error analysis, potentially refining code generation quality. To achieve this objective, we introduce a pipeline that assimilates feedback gleaned from static analysis into the baseline model. Furthermore, we enhance the baseline model by fine-tuning it using samples previously rejected due to static errors. Our empirical observations underscore the efficacy of both strategies in mitigating the occurrence of observed static errors.

Relevant links:

About

This repository contains code base for project titled Leveraging static analysis for evaluating code-generation models developed during the CSCI 544 Applied Natural Language Processing course, Fall 2023, at the University of Southern California (USC).

Pipeline

Code Generation and Static Evaluation Pipeline with feedback

The pipeline employs automated feedback via linters (static code analyzers) to enhance error detection and improve the underlying code generation models. The multi-stage feedback pipeline is designed for effective code generation refinement.

Pipeline Overview:

  1. Context Generation: The pre-processing stage generates the context as part of a prompt, incorporating text from the dataset.
  2. Code Generation: The model utilizes the provided context along with text from the dataset to generate code.
  3. Linters Integration: Linters are executed on the generated code to identify errors.
  4. Feedback Loop: Detected errors are then fed back to the model, enhancing subsequent code generation.

This systematic approach allows for the identification and minimization of errors within the code generation process, enabling precise insights into areas where the model might exhibit shortcomings. Importantly, it facilitates targeted corrections by providing precise information about error types and their respective locations.

The pipeline is partly automated for evaluation and report generation for proof-of-concept purposes.

Fine-tuning:

Fine-Tuning in this context seeks to elevate the baseline model's initial accuracy without necessitating subsequent feedback adjustments. This process involves refining the baseline model for code generation, specifically leveraging the DPO method. We utilize prompt construction following the same procedure as the initial stage of our feedback pipeline.

In this phase, we're utilizing quantized models for streamlined loading into the system and to facilitate running on TPUs.

Important

We have used GPU-P100 on Kaggle and T4 on Google Colaboratory for our experiments.

We utilized the XLCoST dataset for the code completion task. This parallel dataset comprises solutions for data structures and algorithms problems in six programming languages: C++, Java, Python, PHP, C, and C#. Our experiment primarily focuses on program-level samples in C++ and Python. Our baseline model, CodeLlama-7b-Instruct-hf, was trained and evaluated using this dataset.

Directory Structure

Directory Description
data Contains sampled raw and processed XLCoST data for training, evaluation of CodeLlama model
feedback_pipeline Notebooks for running static analysis on code generated after multiple feedback loops
fine_tuning Notebooks for fine-tuning CodeLlama models to enhance code generation using enriched prompts
linter_setup_scripts Bash Scripts for installing, setting up, and supporting linters
preprocessing Code snippets for pre-processing and parsing in notebooks
reports Project-related documentation and reports
results Directory storing results produced at different stages of pipelines
static_analysis_pipeline Scripts encompassing components of static analysis pipeline for evaluating source scripts before and after feedback loops

Setup and Usage

Creating a Python Virtual Environment

  1. Navigate to the Project Directory:

    cd static_analysis_codegen_llms
  2. Create a Virtual Environment:

    python -m venv codegenllm
  3. Activate the Virtual Environment:

    • On Windows:
      codegenllm/Scripts/activate
    • On macOS and Linux:
      source codegenllm/bin/activate
  4. Install Project Dependencies:

    pip install -r requirements.txt

    Executing this script will install the necessary libraries for code generation and linters for Python code evaluation.

Install Linters for Static Evaluation

In case linters are not installed follow the below instructions.

  1. Flake 8 for Python
    cd linter_setup_scripts/flake8_utils
    chmod +x install_flake8.sh
    bash install_flake8.sh
  2. CPPCheck for C++
    cd linter_setup_scripts/cppcheck_utils
    chmod +x install_cppcheck.sh
    bash install_cppcheck.sh

Authors

  1. Sai Anuroop Kesanapalli | MS in Computer Science | USC
  2. Abhishek Anand | MS in Computer Science | USC
  3. Kayvan Shah | MS in Applied Data Science | USC
  4. Indrani Panchangam | MS in Computer Science | USC
  5. Vishesh Mittal | MS in Computer Science | USC

LICENSE

This project is licensed under the BSD 3-Clause License. See the LICENSE file for details.

Disclaimer

The content and code provided in this repository are for educational and demonstrative purposes only. The project may contain experimental features, and the code might not be optimized for production environments. The authors and contributors are not liable for any misuse, damages, or risks associated with the use of this code. Users are advised to review, test, and modify the code to suit their specific use cases and requirements. By using any part of this project, you agree to these terms.

About

This repository contains code base for project titled Leveraging static analysis for evaluating code-generation models developed during the CSCI 544 Applied Natural Language Processing course, Fall 2023, at the University of Southern California (USC).

Topics

Resources

License

Stars

Watchers

Forks