Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval Instruct #3

Open
cat-state opened this issue Apr 13, 2023 · 0 comments
Open

Eval Instruct #3

cat-state opened this issue Apr 13, 2023 · 0 comments

Comments

@cat-state
Copy link
Collaborator

Evaluation of instruction tuned models is difficult for many of the properties we actually care about.
Language modelling and multiple choice benchmarks may capture some aspects of knowledge and reasoning but don't capture many of the properties we care about in instruction tuned dialog agents, like long term coherence, multi task generalisation, ability to use tools, harmlessness, etc.
To address this, we can try to use LLMs to evaluate LLMs.

Ways to do this (in order of increasing complexity):

  1. Generate a LM or forced choice QA dataset and evaluate the instruct model offline
  2. Use reward functions on generations from some (possibly generated) prompt dataset (e.g learn RMs, zero-shot LLM reward functions, etc)
  3. Online exploration and evaluation using another LLM

We should implement these into our repository.
Basic implementations would be:

  1. A script that uses langchain and some seed prompts to generate a multiple choice dataset.
  2. A script that prompts LLMs to rate outputs or generate critiques
  3. A script that has an LLM attempt to use the LLM under test to complete some task, and a check for if that task was successfully completed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant