feat: Self-Rewarding Algorithm with TRT Support #321

trias702 · 2024-09-26T22:14:33Z

What does this PR do ?

Adds support for the Self-Rewarding and Meta-Rewarding algorithms from the following two papers:

https://arxiv.org/abs/2401.10020
https://arxiv.org/abs/2407.19594

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

Please see the new tutorial document at: docs/user-guide/self_rewarding.rst

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

Does the trainer resume and restore model state all states?
Does the trainer support all parallelism techniques(PP, TP, DP)?
Does the trainer support max_steps=-1 and validation?
Does the trainer only call APIs defined in alignable_interface.py?
Does the trainer have proper logging?

Additional Information

Related to # (issue)

Signed-off-by: Gerald Shen <[email protected]>

jgerh · 2024-11-13T01:00:18Z

docs/user-guide/self_rewarding.rst

+- rejected_generated_rewards: as above but for rejected responses
+- rewards_chosen_mean: see below for a definition of what reward means in this context
+- rewards_rejected_mean: as above but for rejected responses
+- bad_samples_per_GBS: the percentage of samples in a GBS which are excluded from training because of bad output from the LLM-as-a-judge (could be caused by parse errors, or all responses being judge with the same score, etc)


Revise

bad_samples_per_GBS: The percentage of samples in a GBS which are excluded from training because of bad output from the LLM-as-a-judge (this could be caused by parse errors, or all responses being judge with the same score, etc.).

jgerh · 2024-11-13T01:01:40Z

docs/user-guide/self_rewarding.rst

+- rewards_chosen_mean: see below for a definition of what reward means in this context
+- rewards_rejected_mean: as above but for rejected responses
+- bad_samples_per_GBS: the percentage of samples in a GBS which are excluded from training because of bad output from the LLM-as-a-judge (could be caused by parse errors, or all responses being judge with the same score, etc)
+- bad_ends_per_GBS: only valid if using TRT, this tracks the percentage of each GBS where TRT generates incorrect stop tokens (should be really low, < 1%)


Revvise

bad_ends_per_GBS: Only valid if using TRT. This tracks the percentage of each GBS where TRT generates incorrect stop tokens (should be really low, < 1%).

jgerh · 2024-11-13T01:01:57Z

docs/user-guide/self_rewarding.rst

+- rewards_rejected_mean: as above but for rejected responses
+- bad_samples_per_GBS: the percentage of samples in a GBS which are excluded from training because of bad output from the LLM-as-a-judge (could be caused by parse errors, or all responses being judge with the same score, etc)
+- bad_ends_per_GBS: only valid if using TRT, this tracks the percentage of each GBS where TRT generates incorrect stop tokens (should be really low, < 1%)
+- preference_loss: the raw DPO variant loss


Revise

preference_loss: The raw DPO variant loss.

jgerh · 2024-11-13T01:02:24Z

docs/user-guide/self_rewarding.rst

+- bad_samples_per_GBS: the percentage of samples in a GBS which are excluded from training because of bad output from the LLM-as-a-judge (could be caused by parse errors, or all responses being judge with the same score, etc)
+- bad_ends_per_GBS: only valid if using TRT, this tracks the percentage of each GBS where TRT generates incorrect stop tokens (should be really low, < 1%)
+- preference_loss: the raw DPO variant loss
+- sft_loss: if adding an SFT loss (categorical cross-entropy loss) for the chosen response, then you can see that raw loss here


Revise

sft_loss: If adding an SFT loss (categorical cross-entropy loss) for the chosen response, then you can see that raw loss here.

jgerh · 2024-11-13T01:02:55Z

docs/user-guide/self_rewarding.rst

+- preference_loss: the raw DPO variant loss
+- sft_loss: if adding an SFT loss (categorical cross-entropy loss) for the chosen response, then you can see that raw loss here
+
+The ``reward`` in this case is calculated as the difference between model log probs and the reference log probs, multiplied by the KL penalty (beta in the original paper), for the ground truth and generated responses.


Fix punctuation.

The reward, in this case, is calculated as the difference between model log probs and the reference log probs, multiplied by the KL penalty (beta in the original paper), for the ground truth and generated responses.

jgerh · 2024-11-13T01:04:06Z

docs/user-guide/self_rewarding.rst

+All metrics will be grouped by either ``train/`` or ``val/`` in WandB, representing whether that metric is from the training or validation set, respectively.
+You can also see a table which will print out the prompt, chosen response, and rejected response for each validation step. This allows you to keep track of response quality and hallucinations.
+
+When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.


Fix capitalization, revise sentence.

When it comes to ideal hyperparameters for self-rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data. Therefore, there is no one-size-fits-all parameter set that will work in all cases.

jgerh · 2024-11-13T01:07:00Z

docs/user-guide/self_rewarding.rst

+You can also see a table which will print out the prompt, chosen response, and rejected response for each validation step. This allows you to keep track of response quality and hallucinations.
+
+When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.
+Additionally, Self-Rewarding (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.


Fix capitalization, revise.

Additionally, self-rewarding training (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.

jgerh · 2024-11-13T01:08:27Z

docs/user-guide/self_rewarding.rst

+
+When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.
+Additionally, Self-Rewarding (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.
+Below are some of observations from the Nvidia Alignment team as to what parameters we have seen work well:


Fix capitalization, revise sentence.

Below are some observations from the NVIDIA Alignment team regarding parameters that we have found to work well:

jgerh · 2024-11-13T01:09:51Z

docs/user-guide/self_rewarding.rst

+Additionally, Self-Rewarding (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.
+Below are some of observations from the Nvidia Alignment team as to what parameters we have seen work well:
+
+* global_batch_size: we recommend using 64, and going up to 128 only for large models (70B+) that are also training with large datasets


Revise

global_batch_size: We recommend using 64, and increasing to 128 only for large models (70B+) that are also training with large datasets.

jgerh · 2024-11-13T01:14:13Z

docs/user-guide/self_rewarding.rst

+Below are some of observations from the Nvidia Alignment team as to what parameters we have seen work well:
+
+* global_batch_size: we recommend using 64, and going up to 128 only for large models (70B+) that are also training with large datasets
+* iterations/epochs: the original paper uses 3 iterations with 1 epoch per iteration, and we find this to be sufficient for most use cases


Revise

iterations/epochs: The original paper uses 3 iterations with 1 epoch per iteration. We find this to be sufficient for most use cases.

jgerh · 2024-11-13T01:14:43Z

docs/user-guide/self_rewarding.rst

+
+* global_batch_size: we recommend using 64, and going up to 128 only for large models (70B+) that are also training with large datasets
+* iterations/epochs: the original paper uses 3 iterations with 1 epoch per iteration, and we find this to be sufficient for most use cases
+* learning rate: for SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7.


Revise

learning rate: For SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7 is recommended.

jgerh · 2024-11-13T01:15:18Z

docs/user-guide/self_rewarding.rst

+* global_batch_size: we recommend using 64, and going up to 128 only for large models (70B+) that are also training with large datasets
+* iterations/epochs: the original paper uses 3 iterations with 1 epoch per iteration, and we find this to be sufficient for most use cases
+* learning rate: for SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7.
+* ref_policy_kl_penalty: we did not see large changes from perturbations to this value; we recommend 0.1 - 0.001


Revise

ef_policy_kl_penalty: We did not see large changes from perturbations to this value. We recommend 0.1 - 0.001.

jgerh · 2024-11-13T01:15:44Z

docs/user-guide/self_rewarding.rst

+* iterations/epochs: the original paper uses 3 iterations with 1 epoch per iteration, and we find this to be sufficient for most use cases
+* learning rate: for SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7.
+* ref_policy_kl_penalty: we did not see large changes from perturbations to this value; we recommend 0.1 - 0.001
+* length_control: depends very much on model size and data, but we found good results with [0,0,0.1]


Revise

length_control: This parameter depends very much on model size and data, but we found good results with [0,0,0.1].

jgerh · 2024-11-13T01:16:38Z

docs/user-guide/self_rewarding.rst

+* learning rate: for SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7.
+* ref_policy_kl_penalty: we did not see large changes from perturbations to this value; we recommend 0.1 - 0.001
+* length_control: depends very much on model size and data, but we found good results with [0,0,0.1]
+* use_meta_judge: we have found stronger results when settings this to true, which is in line with the paper's results


Revise

use_meta_judge: We found stronger results when setting this parameter to true, which is in line with the paper's results

jgerh · 2024-11-13T01:18:05Z

docs/user-guide/self_rewarding.rst

+* ref_policy_kl_penalty: we did not see large changes from perturbations to this value; we recommend 0.1 - 0.001
+* length_control: depends very much on model size and data, but we found good results with [0,0,0.1]
+* use_meta_judge: we have found stronger results when settings this to true, which is in line with the paper's results
+* meta_judge_pcnt: we recommend you do not set this higher than 0.15 (15%). Any higher, and we have observed that the llm-as-a-judge model starts to output identical scores for every response (always a 5)


Revise

meta_judge_pcnt: We recommend not setting this higher than 0.15 (15%). Any higher, and we have observed that the LLM-as-a-judge model starts to output identical scores for every response (always a 5).

jgerh

I completed the technical edit of CHANGELOG.md and
docs/user-guide/self_rewarding.rst. Please review the edits, make the changes in the files, and mark each open thread "resolved."

odelalleau

Still WIP but submitting first batch of comments

CHANGELOG.md

docs/user-guide/self_rewarding.rst

odelalleau · 2024-11-16T16:32:41Z

examples/nlp/gpt/conf/gpt_generation.yaml

Is this file needed for Self-Rewarding? If not let's move it to a different PR

It's needed if you want to follow the self rewarding paper exactly to generate the EFT dataset

I see, it'd be good to keep it then, but it also needs to be documented so that people understand how to generate this EFT dataset. At quick glance I'm not seeing it referenced in the self-rewarding doc => could you add it to explain how to generate an EFT dataset?

examples/nlp/gpt/conf/gpt_self_rewarding.yaml

Signed-off-by: Daniel Egert <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

Signed-off-by: Daniel Egert <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

odelalleau

Just a couple of minor typos

examples/nlp/gpt/conf/gpt_self_rewarding.yaml

Signed-off-by: Daniel Egert <[email protected]>

…/NeMo-Aligner into degert/self-rewarding-trt

Signed-off-by: Daniel Egert <[email protected]>

jgerh · 2024-11-26T17:13:18Z

I completed the technical edit of CHANGELOG.md and
docs/user-guide/self_rewarding.rst. Please review the edits, make the changes in the files, and mark each open thread "resolved."

gshennvm added 30 commits March 21, 2024 16:59

add critic logging

e40ebd6

Signed-off-by: Gerald Shen <[email protected]>

add

72ba6c6

Signed-off-by: Gerald Shen <[email protected]>

cleanup

bfb61e4

Signed-off-by: Gerald Shen <[email protected]>

update

21206c5

Signed-off-by: Gerald Shen <[email protected]>

fix

148acf4

Signed-off-by: Gerald Shen <[email protected]>

fix bug

d7c9990

Signed-off-by: Gerald Shen <[email protected]>

fix bug

537d6e5

Signed-off-by: Gerald Shen <[email protected]>

test

47400ba

Signed-off-by: Gerald Shen <[email protected]>

fix bug

d7b2b23

Signed-off-by: Gerald Shen <[email protected]>

fix

ce76226

Signed-off-by: Gerald Shen <[email protected]>

add

8edf534

Signed-off-by: Gerald Shen <[email protected]>

fix

6379a2e

Signed-off-by: Gerald Shen <[email protected]>

fix again

eadae31

Signed-off-by: Gerald Shen <[email protected]>

fix

e2b97d9

Signed-off-by: Gerald Shen <[email protected]>

fix mean

d9bdf7c

Signed-off-by: Gerald Shen <[email protected]>

fix

1c7d215

Signed-off-by: Gerald Shen <[email protected]>

add debug

3638301

Signed-off-by: Gerald Shen <[email protected]>

fix

4cca85f

Signed-off-by: Gerald Shen <[email protected]>

add data iter for VP

1b19bdd

Signed-off-by: Gerald Shen <[email protected]>

move

3f045ae

Signed-off-by: Gerald Shen <[email protected]>

fixing

3c9fe3d

Signed-off-by: Gerald Shen <[email protected]>

add

f36f394

Signed-off-by: Gerald Shen <[email protected]>

chunking needs to be moved out

5211bc2

Signed-off-by: Gerald Shen <[email protected]>

fix

0f59edf

Signed-off-by: Gerald Shen <[email protected]>

fix metrics

c3fe2f7

Signed-off-by: Gerald Shen <[email protected]>

fix dtype

5d3e07d

Signed-off-by: Gerald Shen <[email protected]>

merge

15887e5

Signed-off-by: Gerald Shen <[email protected]>

fix

2ad76ba

Signed-off-by: Gerald Shen <[email protected]>

make the global id management into a class

9d9a6b6

Signed-off-by: Gerald Shen <[email protected]>

fix

d6fb55d

Signed-off-by: Gerald Shen <[email protected]>

jgerh reviewed Nov 13, 2024

View reviewed changes

odelalleau reviewed Nov 16, 2024

View reviewed changes

trias702 and others added 6 commits November 18, 2024 14:56

Made config yaml fixes in response to initial comments

780e8ab

Signed-off-by: Daniel Egert <[email protected]>

Updated to main branch

cc487fb

Signed-off-by: Daniel Egert <[email protected]>

Removed generation_batch_size param from TRT

83e830a

Signed-off-by: Daniel Egert <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5b7aae3

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

Minor fixes for new TRT api

a1f9620

Signed-off-by: Daniel Egert <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

01aced0

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

odelalleau reviewed Nov 21, 2024

View reviewed changes

examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved

examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved

trias702 added 5 commits November 20, 2024 22:35

SPIN bug fixes and migrated generation to work with TRT v13

224de3d

Signed-off-by: Daniel Egert <[email protected]>

Merge branch 'degert/self-rewarding-trt' of https://github.com/NVIDIA…

82ff16d

…/NeMo-Aligner into degert/self-rewarding-trt

Changes to self_rewarding.yaml in response to review comments

c608520

Signed-off-by: Daniel Egert <[email protected]>

Added Torch Dynamo logic to self-rewarding

e4d36b6

Signed-off-by: Daniel Egert <[email protected]>

Fixed minor issue with TRT v13 compatibility

34e4994

Signed-off-by: Daniel Egert <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Self-Rewarding Algorithm with TRT Support #321

feat: Self-Rewarding Algorithm with TRT Support #321

trias702 commented Sep 26, 2024

jgerh Nov 13, 2024 •

edited

Loading

jgerh Nov 13, 2024 •

edited

Loading

jgerh Nov 13, 2024 •

edited

Loading

jgerh Nov 13, 2024 •

edited

Loading

jgerh Nov 13, 2024

jgerh Nov 13, 2024 •

edited

Loading

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024 •

edited

Loading

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh left a comment

odelalleau left a comment

odelalleau Nov 16, 2024

trias702 Nov 18, 2024

odelalleau Nov 21, 2024

odelalleau left a comment

jgerh commented Nov 26, 2024

feat: Self-Rewarding Algorithm with TRT Support #321

Are you sure you want to change the base?

feat: Self-Rewarding Algorithm with TRT Support #321

Conversation

trias702 commented Sep 26, 2024

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Checklist when contributing a new algorithm

Additional Information

jgerh Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

jgerh Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

jgerh Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

jgerh Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh left a comment

Choose a reason for hiding this comment

odelalleau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

odelalleau left a comment

Choose a reason for hiding this comment

jgerh commented Nov 26, 2024

jgerh Nov 13, 2024 •

edited

Loading

jgerh Nov 13, 2024 •

edited

Loading

jgerh Nov 13, 2024 •

edited

Loading

jgerh Nov 13, 2024 •

edited

Loading

jgerh Nov 13, 2024 •

edited

Loading

jgerh Nov 13, 2024 •

edited

Loading