ℹ️ About | 📖 Watermarked Program Transformation | 🚀 Quick Start
Official implementation of "Is Watermarking LLM-Generated Code Robust?".
In the paper, we present the first study of the robustness of existing watermarking techniques on Python code generated by large language models. Although existing works showed that watermarking can be robust for natural language, we show that it is easy to remove these watermarks on code by semantic-preserving transformations. We propose an algorithm that walks the Abstract Syntax Tree (AST) of the watermarked code and randomly applies semantic-preserving program modifications. We observe significantly lower true-positive rate (TPR) of detection even under simple modifications, underscoring the need for robust LLM watermarks tailored specifically for code.
This repository contains code for:
🌊 Watermarking LLM-generated code |
💯 Evaluating functional correctness of watermarked code on the HumanEval dataset |
✏️ Applying realistic semantic-preserving transformations to the watermarked code |
If you find this repository useful, please cite our paper:
@misc{suresh2024watermarking,
title={Is Watermarking LLM-Generated Code Robust?},
author={Tarun Suresh and Shubham Ugare and Gagandeep Singh and Sasa Misailovic},
year={2024},
eprint={2403.17983},
archivePrefix={arXiv},
primaryClass={cs.CR}
}
In practice, a user may modify watermarked LLM-generated code to better integrate within a larger program or evade detection. We consider that the user has only black-box input-output access to model and has no knowledge of the watermarking algorithm. The user can apply a series of semantic-preserving transformations, e.g., inserting print statements or renaming variables, to modify the code. We replicate these program modifications in an algorithm that:
- Takes the watermarked code and the number of transformations to apply as input.
- Parses the watermarked code to obtain the AST representation of the code.
- Randomly selects a transformation from a set of transformations to apply.
- Traverses the AST to determine the set of all possible insertion, deletion, or substitution locations for that transformation.
- Transforms the AST at a randomly selected subtree by replacing the sequence of terminals with a "hole" and then completing it with a randomly syntactically-valid sequence.
We implement the following semantic-preserving transformations:
Transformation | Implemented |
---|---|
Replace True False |
✅ |
Rename Variables |
✅ |
Insert Print Statements |
✅ |
Wrap With Try Catch |
✅ |
Remove Comments |
✅ |
Unroll While Loops |
✅ |
Add Dead Code |
✅ |
Install human-eval: https://github.com/openai/human-eval
Install llama: https://github.com/facebookresearch/llama
Model checkpoint location is hardcoded to: /share/models/llama_model/llama/
for now
The original LM Watermarking implementation is enabled by the huggingface/transformers 🤗 library. To convert the Llama model weights to the Hugging Face Transformers format, run the following script:
python lpw/convert_llama_weights_to_hf.py \
--input_dir /share/models/llama_model/llama/ --model_size 13B --output_dir /share/models/llama_model/hf/13B/
Thereafter, models can be loaded via:
from transformers import LlamaForCausalLM, LlamaTokenizer
model = LlamaForCausalLM.from_pretrained("/share/models/llama_model/hf/13B/")
tokenizer = LlamaTokenizer.from_pretrained("/share/models/llama_model/hf/13B/")
Watermarking can be run for any Language Model supported by HuggingFace. Currently, the watermarking algorithms supported are UMD, SWEET, Unigram, and RobDist. The following command runs watermarking with the UMD algorithm on Llama-7B generating Python completions to problems in the HumanEval dataset:
python lpw/run_watermark.py \
--model_name_or_path /share/models/llama_model/hf/Llama-7b --language python --dataset multi-humaneval
Watermarking results are written to results/watermarking/
Thereafter, one can apply any of the aforementioned semantic preservations or a combination of them to the watermarked code. The following command inserts 5 print statements into the watermarked code:
python lpw/perturb_watermark.py \
--model_name_or_path /share/models/llama_model/hf/Llama-7b --language python --dataset multi-humaneval --perturbation_ids 5 --depths 5
A full list of arguments can be found here.