title

abstract

layout

series

publisher

issn

id

month

tex_title

firstpage

lastpage

page

order

cycles

bibtex_author

author

date

address

container-title

volume

genre

issued

pdf

extras

Self-Evaluation Improves Selective Generation in Large Language Models

Safe deployment of large language models (LLMs) may benefit from a reliable method for assessing their generated content to determine when to abstain or to selectively generate. While likelihood-based metrics such as perplexity are widely employed, recent research has demonstrated the limitations of using sequence-level probability estimates given by LLMs as reliable indicators of generation quality. Conversely, LLMs have demonstrated strong calibration at the token level, particularly when it comes to choosing correct answers in multiple-choice questions or evaluating true/false statements. In this work, we reformulate open-ended generation tasks into token-level prediction tasks, and leverage LLMs’ superior calibration at the token level. We instruct an LLM to self-evaluate its answers, employing either a multi-way comparison or a point-wise evaluation approach, with the option to include an “None of the above” option to express the model’s uncertainty explicitly. We benchmark a range of scoring methods based on self-evaluation and evaluate their performance in selective generation using TruthfulQA and TL;DR. Through extensive experiments with PaLM-2 and GPT-3, we demonstrate that self-evaluation based scores not only improve accuracy, but also correlate better with the overall quality of generated content.

inproceedings

Proceedings of Machine Learning Research

PMLR

2640-3498

ren23a

0

Self-Evaluation Improves Selective Generation in Large Language Models

49

64

49-64

49

false

Ren, Jie and Zhao, Yao and Vu, Tu and Liu, Peter J. and Lakshminarayanan, Balaji

given	family
Jie	Ren

given	family
Yao	Zhao

given	family
Tu	Vu

given	family
Peter J.	Liu

given	family
Balaji	Lakshminarayanan

2023-04-24

Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops

239

inproceedings

date-parts

2023

4

24

https://proceedings.mlr.press/v239/ren23a/ren23a.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2023-04-24-ren23a.md

2023-04-24-ren23a.md

Files

2023-04-24-ren23a.md

Latest commit

History

2023-04-24-ren23a.md

File metadata and controls