🐾 Process-supervised RM Trainer #2127

gaetanlop · 2024-09-26T02:37:18Z

What does this PR do?

Adding support for process-supervised reward training to TRL as requested in #2110 .

List of papers using PRMs: [1], [2], [3], [4]...

Fixes # (issue)

#2110

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

@lewtun @kashif

lewtun · 2024-09-26T14:25:27Z

This is awesome @gaetanlop ! Would you like some early feedback on the PR or would you prefer I wait a bit until it's more polished?

gaetanlop · 2024-09-27T02:54:20Z

Hey @lewtun, thank you for the message. Currently, the only files that are more or less ready are prm_trainer.py and prm_config.py. The rest are just placeholders that I haven’t had the opportunity to work on yet.

Implementing a PRMs seems to be pretty straighforward, it seems to be a token classification task where only prediction for the last token of each step gets assigned a label and other tokens are ignored during loss calculation.

If the dataset isn’t pre-tokenized, I assume it should contain the following columns:

prompt: Either a string or past messages
steps: A list of strings
labels: A list of integers corresponding to the label associated to each step

Are you aware of an HF dataset to train PRMs for the example file? Also, how can I add a new subset to the trl-internal-testing/zen dataset to support stepwise reward models for the unit test of the prm_trainer?

Thanks again for your time!

gaetanlop · 2024-09-28T18:37:01Z

PR ready for review. I have changed the naming conventions that I used before prm to the suggested naming in #2110 stepwise.

Tests: I created a dummy_dataset but we should add a subset to trl-internal-testing/zen as done in other scripts.
Example: The example is currently using a placeholder for the dataset name as to the best of my knowledge trl didn't release a dataset for stepwise reasoning on HF. We should add this too.

lewtun

Thank you for the very clean PR @gaetanlop - this looks great! I've left some minor suggestions regarding the structure, but aside from that and having a smallish dataset in the right format we can sanity check that the accuracy goes up, loss goes down etc I think this is quite close to being ready

docs/source/_toctree.yml

docs/source/stepwise_reward_trainer.mdx

docs/source/dataset_formats.mdx

examples/scripts/stepwise_reward_modeling.py

trl/trainer/stepwise_reward_config.py

trl/trainer/stepwise_reward_trainer.py

gaetanlop · 2024-10-01T01:53:21Z

Thanks for looking at this @lewtun. Seems like trl-internal-testing/zen is the dataset you are using for testing. I have done a PR to trl-lib/zen, should I also PR trl-internal-testing/zen to add 19 samples of PRM800K for testing or are you handling it on your side (it looks like they are both the same dataset)?

gaetanlop · 2024-11-26T14:51:21Z

@qgallouedec yes of course, as you prefer, I was following the implementation done in the KTOTrainer. You can open a PR otherwise I will make the changes later today

qgallouedec · 2024-11-26T16:38:54Z

Should we add the separator token between the prompt and the first step? If you don't (like the current code) you get something like:

prompt = "This is my prompt."
completions = ["This is my first step.", "This is my second step."]
separator = "\n"

# Processing here

result == "This is my prompt.This is my first step. This is my second step."
#                           ^💀

qgallouedec · 2024-11-26T17:32:50Z

gaetanlop#1

I still need to ~~add the collator and then it's ready.~~ No collator is needed in fact.

qgallouedec · 2024-11-26T20:52:09Z

First trained model: https://huggingface.co/qgallouedec/Qwen2-0.5B-Reward

Refactor `tokenize_row`

gaetanlop · 2024-11-27T03:35:22Z

@qgallouedec Thank you for the refactoring work on the tokenize_row function. I have made some adjustments to ensure proper handling of special tokens. Also, I refined the label creation process and updated the tokenize_row function to support truncation based on both max_length and max_completion_ids. I have added some tests to confirm that the updated tokenize_row function behaves as intended.

I also made some experiments. The model gets 99.8% accuracy after just a few steps... It might just be predicting True all the time, I will need to double check

kashif · 2024-11-27T07:20:31Z

docs/source/stepwise_reward_trainer.mdx

+
+## Overview
+
+Stepwise or process reward models were proposed in [Solving math word problems with processand outcome-based feedback](https://arxiv.org/pdf/2211.14275) by Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving and Irina Higgins.


Suggested change

Stepwise or process reward models were proposed in [Solving math word problems with processand outcome-based feedback](https://arxiv.org/pdf/2211.14275) by Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving and Irina Higgins.

Stepwise or process reward models were proposed in [Solving math word problems with process- and outcome-based feedback](https://huggingface.co/papers/2211.14275) by Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins.

qgallouedec · 2024-11-27T10:02:54Z

Thanks, good catch aa33e62! I don't know how the training code worked without it, maybe with the padding?

to ensure proper handling of special tokens

Yes that was my question with 228aa31. Do you have any example of decoder model that have a bos_token?

qgallouedec · 2024-11-27T11:57:16Z

With the fix it seems to overfit. Is it related to the redundancy in the data?

gaetanlop added 3 commits September 25, 2024 22:29

initial skeleton

357a8c6

tokenize fn

841f7a1

adding bos and eos to tokenization fn

641e899

gaetanlop marked this pull request as draft September 26, 2024 03:15

gaetanlop added 3 commits September 26, 2024 22:32

prmtrainer

106bc0e

fixing small typo in tokenize

0163dcc

typo in input_ids and labels construction

c2720d7

gaetanlop added 12 commits September 26, 2024 22:58

numpy dimension

5034083

introduce the stepwise reward trainer

8818b6a

update markdown files

b777d1c

let user decide post step separator in config

afa9e0a

doc post_step_separator

2dd752d

do not add post step_tokens to last step of the reasoning process

613d838

renaming prm to stepwisereward

b96ef4d

formatting

161f5de

fix tokenize kwargs

93e6652

adapt test to the new post_token args

3ec4ebe

adding example script

1461a61

fix small typo

8c4ac31

gaetanlop marked this pull request as ready for review September 28, 2024 18:34

lewtun reviewed Sep 30, 2024

View reviewed changes

gaetanlop added 2 commits September 30, 2024 20:33

add create_model_card and renaming

8b3fa52

fixing booleans

8e4e159

gaetanlop changed the title ~~[DRAFT] Process-supervised RM Trainer~~ Process-supervised RM Trainer Oct 1, 2024

gaetanlop added 2 commits September 30, 2024 21:46

Adding the new stepwise_preference instead of placeholders for datasets

c60bc40

formatting

614fb4e

qgallouedec changed the title ~~Process-supervised RM Trainer~~ 🐾 Process-supervised RM Trainer Nov 26, 2024

Refactor tokenize_row

be6e843

qgallouedec added 12 commits November 26, 2024 17:36

Update max_completion_length parameter in StepwiseRewardConfig

e8c782d

Collator

4c83f41

Update comment

a93138f

Update type hint

072794a

fix table

5b10e38

Remove collator

5a8d0a2

don't need pad token id

f4ba54f

add error back

fd204d7

max length args

ebc8fb1

use tokenizer arg

95a4a46

Update doc

46b6bd6

label -> labels

201bdf2

gaetanlop and others added 7 commits November 26, 2024 20:50

Merge pull request #1 from huggingface/prm-trainer-qgallouedec

4f28ed7

Refactor `tokenize_row`

Merge branch 'main' into prmtrainer

0527531

fixing tokenization issues in tokenize row

228aa31

correct labels for token classification

aa33e62

adding max_length to tokenize_row

4cd0b79

reformat tests

c58db4b

adding tests for tokenize row

1385f46

fixing typos in comments

b2d45a8

kashif reviewed Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐾 Process-supervised RM Trainer #2127

🐾 Process-supervised RM Trainer #2127

gaetanlop commented Sep 26, 2024 •

edited

Loading

lewtun commented Sep 26, 2024

gaetanlop commented Sep 27, 2024

gaetanlop commented Sep 28, 2024 •

edited

Loading

lewtun left a comment

gaetanlop commented Oct 1, 2024 •

edited

Loading

gaetanlop commented Nov 26, 2024

qgallouedec commented Nov 26, 2024 •

edited

Loading

qgallouedec commented Nov 26, 2024 •

edited

Loading

qgallouedec commented Nov 26, 2024

gaetanlop commented Nov 27, 2024 •

edited

Loading

kashif Nov 27, 2024 •

edited

Loading

qgallouedec commented Nov 27, 2024

qgallouedec commented Nov 27, 2024


		## Overview

		Stepwise or process reward models were proposed in [Solving math word problems with processand outcome-based feedback](https://arxiv.org/pdf/2211.14275) by Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving and Irina Higgins.

🐾 Process-supervised RM Trainer #2127

Are you sure you want to change the base?

🐾 Process-supervised RM Trainer #2127

Conversation

gaetanlop commented Sep 26, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

lewtun commented Sep 26, 2024

gaetanlop commented Sep 27, 2024

gaetanlop commented Sep 28, 2024 • edited Loading

lewtun left a comment

Choose a reason for hiding this comment

gaetanlop commented Oct 1, 2024 • edited Loading

gaetanlop commented Nov 26, 2024

qgallouedec commented Nov 26, 2024 • edited Loading

qgallouedec commented Nov 26, 2024 • edited Loading

qgallouedec commented Nov 26, 2024

gaetanlop commented Nov 27, 2024 • edited Loading

kashif Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

qgallouedec commented Nov 27, 2024

qgallouedec commented Nov 27, 2024

gaetanlop commented Sep 26, 2024 •

edited

Loading

gaetanlop commented Sep 28, 2024 •

edited

Loading

gaetanlop commented Oct 1, 2024 •

edited

Loading

qgallouedec commented Nov 26, 2024 •

edited

Loading

qgallouedec commented Nov 26, 2024 •

edited

Loading

gaetanlop commented Nov 27, 2024 •

edited

Loading

kashif Nov 27, 2024 •

edited

Loading