Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

openhands-agent
Copy link
Contributor

@openhands-agent openhands-agent commented Nov 23, 2024

This PR fixes #5222 by reorganizing the evaluation directory structure to improve clarity and maintainability.

Changes

  • Created evaluation/benchmarks/ directory to house all ML literature benchmarks
  • Kept utility directories (utils, integration_tests, regression, static) directly under evaluation/
  • Updated paths in documentation and GitHub workflows to reflect the new structure
  • Added missing benchmarks to evaluation/README.md:
    • Commit0 and DiscoveryBench under Software Engineering
    • Browsing Delegation under Web Browsing
    • ScienceAgentBench under Misc. Assistance

Testing

  • All pre-commit hooks pass (ruff, mypy, etc.)
  • All unit tests pass (377 tests)

Review Notes

Key files to review:

  • .github/workflows/eval-runner.yml - Updated paths for integration tests and benchmarks
  • evaluation/README.md - Added missing benchmarks and updated paths
  • Documentation files - Updated references to benchmark paths

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:a759dd8-nikolaik   --name openhands-app-a759dd8   docker.all-hands.dev/all-hands-ai/openhands:a759dd8

@neubig neubig marked this pull request as ready for review November 23, 2024 13:50
@neubig
Copy link
Contributor

neubig commented Nov 23, 2024

Just noting that I have confirmed the code and it looks good to me, but I'd like a second review.

@neubig neubig requested a review from enyst November 23, 2024 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Refactor]: Refactor the evaluation directory
3 participants