Replies: 2 comments 1 reply
-
Fully agree
I think we already have a clear work package in this:
After this we could for example: |
Beta Was this translation helpful? Give feedback.
1 reply
-
AWESOME! I'm excited about these ideas, though I admit some of my points might be less developed at this stage. I believe our community will appreciate the direction we're heading.
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
With APPS and MBPP integrated, we want to go towards a benchmark driven development paradigm. Vaguely, this means that we want to make it really easy for community members to benchmark their craziest ideas using our infrastructure and when something works really well, we will integrate it in the cli tool.
What this is going to look like in greater detail and what initial work packages we need is to be discussed in this thread. I'll shoot out some ideas:
Public benchmark score board: Publicly hosted website (potentially on GitHub) that keeps track of performance of different agent implementations. Getting a score up should be at least semi-automatic. It would make sense to log what github repo at which commit was used on what benchmarks and potentially more statistics such as runtime (as proxy for token usage, which would be harder to log).
(Tricky one) Explicit modularization of improvement areas. A potential problem is that, over time, users would provide high scoring agents with different strengths and weaknesses and we will probably want to cherry pick features. It makes sense to proactively segment "improvement areas" such as self-heal, rag, diffs etc, and make it easy for users to focus on improving one such feature at a time, streamlining the integration process to the cli.
Lets discuss and turn this into concrete work packages @ErikBjare @captivus @TheoMcCabe @AntonOsika @similato87 @azrv
Beta Was this translation helpful? Give feedback.
All reactions