Enabling benchmark driven development paradigm #1107

ATheorell · 2024-04-08T17:55:59Z

ATheorell
Apr 8, 2024
Maintainer

With APPS and MBPP integrated, we want to go towards a benchmark driven development paradigm. Vaguely, this means that we want to make it really easy for community members to benchmark their craziest ideas using our infrastructure and when something works really well, we will integrate it in the cli tool.
What this is going to look like in greater detail and what initial work packages we need is to be discussed in this thread. I'll shoot out some ideas:

Public benchmark score board: Publicly hosted website (potentially on GitHub) that keeps track of performance of different agent implementations. Getting a score up should be at least semi-automatic. It would make sense to log what github repo at which commit was used on what benchmarks and potentially more statistics such as runtime (as proxy for token usage, which would be harder to log).
(Tricky one) Explicit modularization of improvement areas. A potential problem is that, over time, users would provide high scoring agents with different strengths and weaknesses and we will probably want to cherry pick features. It makes sense to proactively segment "improvement areas" such as self-heal, rag, diffs etc, and make it easy for users to focus on improving one such feature at a time, streamlining the integration process to the cli.

Lets discuss and turn this into concrete work packages @ErikBjare @captivus @TheoMcCabe @AntonOsika @similato87 @azrv

AntonOsika · 2024-04-08T22:37:13Z

AntonOsika
Apr 8, 2024
Maintainer

Fully agree

above is clear.

I think we already have a clear work package in this:

Wrap swe-agent and be able to run it with our setup

After this we could for example:
2. Create LLM tool (e.g. built on gpt-engineer, or someone elses code) to make it easier to wrap any open source agent for benchmarking – using LLMs to do it

1 reply

ATheorell Apr 9, 2024
Maintainer Author

Agreed! The idea with the discussion is to not just get the infrastructure running, but also to make it really easy and appealing to make high quality contributions. For that, things like logging (scoring) and on-boarding will be key.

similato87 · 2024-04-10T03:36:14Z

similato87
Apr 10, 2024
Maintainer

AWESOME! I'm excited about these ideas, though I admit some of my points might be less developed at this stage. I believe our community will appreciate the direction we're heading.

Public Benchmark Scoreboard:
- This could significantly enhance our database with detailed performance metrics of various agents.
  - It might motivate passionate users to iteratively improve their submissions to climb the scoreboard, enriching our pool of innovative ideas.
  - GitHub Pages appears to be a promising option for this, enabling us to link directly to our project and display results in our README via a widget.
  - Including detailed logs such as the GitHub repository, specific commits, benchmark results, and runtime statistics would offer a clear and transparent performance overview.
Explicit Modularization of Improvement Areas:
- Dividing improvement tasks into more manageable segments seems like a beneficial strategy. My experience suggests that tackling complex improve tasks in smaller parts often leads to more satisfactory outcomes.
  - This approach could offer users the flexibility to focus and refine one specific area before moving on to the next, avoiding the "black box" scenario.
  - However, the method of assigning weights to different functionalities and attributing scores based on particular improvements will need careful consideration and discussion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling benchmark driven development paradigm #1107

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Enabling benchmark driven development paradigm #1107

ATheorell Apr 8, 2024 Maintainer

Replies: 2 comments · 1 reply

AntonOsika Apr 8, 2024 Maintainer

ATheorell Apr 9, 2024 Maintainer Author

similato87 Apr 10, 2024 Maintainer

ATheorell
Apr 8, 2024
Maintainer

Replies: 2 comments 1 reply

AntonOsika
Apr 8, 2024
Maintainer

ATheorell Apr 9, 2024
Maintainer Author

similato87
Apr 10, 2024
Maintainer