Eval Instruct #3

cat-state · 2023-04-13T21:19:27Z

Evaluation of instruction tuned models is difficult for many of the properties we actually care about.
Language modelling and multiple choice benchmarks may capture some aspects of knowledge and reasoning but don't capture many of the properties we care about in instruction tuned dialog agents, like long term coherence, multi task generalisation, ability to use tools, harmlessness, etc.
To address this, we can try to use LLMs to evaluate LLMs.

Ways to do this (in order of increasing complexity):

Generate a LM or forced choice QA dataset and evaluate the instruct model offline
Use reward functions on generations from some (possibly generated) prompt dataset (e.g learn RMs, zero-shot LLM reward functions, etc)
Online exploration and evaluation using another LLM

We should implement these into our repository.
Basic implementations would be:

A script that uses langchain and some seed prompts to generate a multiple choice dataset.
A script that prompts LLMs to rate outputs or generate critiques
A script that has an LLM attempt to use the LLM under test to complete some task, and a check for if that task was successfully completed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval Instruct #3

Eval Instruct #3

cat-state commented Apr 13, 2023

Eval Instruct #3

Eval Instruct #3

Comments

cat-state commented Apr 13, 2023