2024/blog/rlhf-without-rl/ #149

utterances-bot · 2024-06-23T12:29:16Z

RLHF without RL - Direct Preference Optimization | ICLR Blogposts 2024

We discuss the RL part of RLHF and its recent displacement by direct preference optimization (DPO). With DPO, a language model can be aligned with human preferences without sampling from an LM, thereby significantly simplifying the training process. By now, DPO has been implemented in many projects and seems to be here to stay.

https://iclr-blogposts.github.io/2024/blog/rlhf-without-rl/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2024/blog/rlhf-without-rl/ #149

2024/blog/rlhf-without-rl/ #149

utterances-bot commented Jun 23, 2024

2024/blog/rlhf-without-rl/ #149

2024/blog/rlhf-without-rl/ #149

Comments

utterances-bot commented Jun 23, 2024

RLHF without RL - Direct Preference Optimization | ICLR Blogposts 2024