Releases: evalplus/repoqa
Releases · evalplus/repoqa
RepoQA v0.1.2
Notable updates
- Fixed wget dependency
- Propageted
trust_remote_code
for tokenizers
Resources
- PyPI: https://pypi.org/project/repoqa/0.1.2/
- Homepage: https://evalplus.github.io/repoqa.html
- Dataset release: https://github.com/evalplus/repoqa_release
RepoQA v0.1.1
Notable updates
- Trimming output before post-processing largely improved certain cases @ganler
- Fixed HF backend @zyzzzz-123 @ganler
- HF backend supports
attn-implementation
to enable flash-attn 2 @ganler - Optimized the computation of trained context size @JialeTomTian #38
- End-of-string optimization largely improved the inference speed @ganler
- Optimized post-processing accuracy using a better regex expression @ganler
Finished features/fixes are listed as noticeable.
WIP updates will be listed in subsequent releases when they are fully done.
Full changelog: v0.1.0...v0.1.1
Quick examples
pip install repoqa
repoqa.search_needle_function --model "gpt4-turbo" --backend openai
repoqa.search_needle_function --model "claude-3-haiku-20240307" --backend anthropic
repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend vllm
repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend hf --trust-remote-code --attn-implementation "flash_attention_2"
repoqa.search_needle_function --model "gemini-1.5-pro-latest" --backend google
Resources
- PyPI: https://pypi.org/project/repoqa/0.1.1/
- Homepage: https://evalplus.github.io/repoqa.html
- Dataset release: https://github.com/evalplus/repoqa_release
RepoQA v0.1.0
RepoQA for Long-Context Code Understanding
Introduction
RepoQA is a benchmark that aims to exercise LLM's long-context code understanding ability.
- Multi-Lingual: RepoQA now supports repositories from 5 programming languages:
- Python
- C++
- TypeScript
- Rust
- Java
- Application-driven: RepoQA aims to evaluate LLMs on long-context tasks that can reflect real-life uses. Before RepoQA, long-context evaluators mainly focus on using synthetic tasks to examine the vulnerable parts of the LLM's long context, such as "Needle in the Code" by CodeQwen and "Needle in a Haystack".
- The first RepoQA task we propose is 🔍 Searching Needle Function:
- 500 sub-tasks = 5 PLs x 10 repos x 10 needles
- Asks the model to search the corresponding function (we call it needle function) given a precise natural language description
RepoQA is easy to use
- Supports following backends
- OpenAI
- Anthropic
- vLLM
- HuggingFace transformers
- Google Generative AI API (Gemini)
- 🚀 Evaluation can be done in one command
- 🏆 A leaderboard: https://evalplus.github.io/repoqa.html
Quick examples
pip install repoqa
repoqa.search_needle_function --model "gpt4-turbo" --backend openai
repoqa.search_needle_function --model "claude-3-haiku-20240307" --backend anthropic
repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend vllm
repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend hf --trust-remote-code
repoqa.search_needle_function --model "gemini-1.5-pro-latest" --backend google
Resources
- PyPI: https://pypi.org/project/repoqa/0.1.0/
- Homepage: https://evalplus.github.io/repoqa.html
- Dataset release: https://github.com/evalplus/repoqa_release
RepoQA v0.1.0 Release Candidate 1
v0.1.0rc1 refactor: clean files for release
RepoQA Search-Needle-Function Dataset 2024-04-20
dev-dataset refactor: optimize dataset name
Evaluated Results
See attachment; some results might be incomplete.
Release of dependency and base dataset
We use this release to upload dependency files of different languages produced by https://github.com/evalplus/repoqa/tree/main/scripts/curate/dep_analysis