You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Evaluation of instruction tuned models is difficult for many of the properties we actually care about.
Language modelling and multiple choice benchmarks may capture some aspects of knowledge and reasoning but don't capture many of the properties we care about in instruction tuned dialog agents, like long term coherence, multi task generalisation, ability to use tools, harmlessness, etc.
To address this, we can try to use LLMs to evaluate LLMs.
Ways to do this (in order of increasing complexity):
Generate a LM or forced choice QA dataset and evaluate the instruct model offline
Use reward functions on generations from some (possibly generated) prompt dataset (e.g learn RMs, zero-shot LLM reward functions, etc)
Online exploration and evaluation using another LLM
We should implement these into our repository.
Basic implementations would be:
A script that uses langchain and some seed prompts to generate a multiple choice dataset.
A script that prompts LLMs to rate outputs or generate critiques
A script that has an LLM attempt to use the LLM under test to complete some task, and a check for if that task was successfully completed.
The text was updated successfully, but these errors were encountered:
Evaluation of instruction tuned models is difficult for many of the properties we actually care about.
Language modelling and multiple choice benchmarks may capture some aspects of knowledge and reasoning but don't capture many of the properties we care about in instruction tuned dialog agents, like long term coherence, multi task generalisation, ability to use tools, harmlessness, etc.
To address this, we can try to use LLMs to evaluate LLMs.
Ways to do this (in order of increasing complexity):
We should implement these into our repository.
Basic implementations would be:
The text was updated successfully, but these errors were encountered: