Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

openhands-agent · 2024-11-23T13:15:59Z

This PR fixes #5222 by reorganizing the evaluation directory structure to improve clarity and maintainability.

Changes

Created evaluation/benchmarks/ directory to house all ML literature benchmarks
Kept utility directories (utils, integration_tests, regression, static) directly under evaluation/
Updated paths in documentation and GitHub workflows to reflect the new structure
Added missing benchmarks to evaluation/README.md:
- Commit0 and DiscoveryBench under Software Engineering
- Browsing Delegation under Web Browsing
- ScienceAgentBench under Misc. Assistance

Testing

All pre-commit hooks pass (ruff, mypy, etc.)
All unit tests pass (377 tests)

Review Notes

Key files to review:

.github/workflows/eval-runner.yml - Updated paths for integration tests and benchmarks
evaluation/README.md - Added missing benchmarks and updated paths
Documentation files - Updated references to benchmark paths

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:a759dd8-nikolaik   --name openhands-app-a759dd8   docker.all-hands.dev/all-hands-ai/openhands:a759dd8

…arks while keeping other directories directly under evaluation/

neubig · 2024-11-23T13:51:21Z

Just noting that I have confirmed the code and it looks good to me, but I'd like a second review.

evaluation/README.md

…rks/

Fix issue #5222: [Refactor]: Refactor the evaluation directory

2a3460c

github-actions bot mentioned this pull request Nov 23, 2024

[Refactor]: Refactor the evaluation directory #5222

Open

openhands-agent added 2 commits November 23, 2024 13:31

Update paths to reference evaluation/benchmarks/ directory for benchm…

ecb1d81

…arks while keeping other directories directly under evaluation/

Add missing benchmarks to evaluation/README.md

4c62196

neubig requested review from xingyaoww and mamoodi November 23, 2024 13:49

neubig marked this pull request as ready for review November 23, 2024 13:50

neubig mentioned this pull request Nov 23, 2024

docs: improve evaluation README with proper links and formatting #5221

Open

1 task

enyst self-requested a review November 23, 2024 16:25

enyst reviewed Nov 23, 2024

View reviewed changes

evaluation/README.md Show resolved Hide resolved

Fix imports to match new directory structure under evaluation/benchma…

a759dd8

…rks/

neubig requested a review from enyst November 23, 2024 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

openhands-agent commented Nov 23, 2024 •

edited by github-actions bot

Loading

neubig commented Nov 23, 2024

Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

Are you sure you want to change the base?

Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

Conversation

openhands-agent commented Nov 23, 2024 • edited by github-actions bot Loading

Changes

Testing

Review Notes

neubig commented Nov 23, 2024

openhands-agent commented Nov 23, 2024 •

edited by github-actions bot

Loading