Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Self-Rewarding Algorithm with TRT Support #321

Open
wants to merge 282 commits into
base: main
Choose a base branch
from

Conversation

trias702
Copy link
Collaborator

What does this PR do ?

Adds support for the Self-Rewarding and Meta-Rewarding algorithms from the following two papers:

https://arxiv.org/abs/2401.10020
https://arxiv.org/abs/2407.19594

Changelog

  • Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

Please see the new tutorial document at: docs/user-guide/self_rewarding.rst

Before your PR is "Ready for review"

Pre checks:

Checklist when contributing a new algorithm

  • Does the trainer resume and restore model state all states?
  • Does the trainer support all parallelism techniques(PP, TP, DP)?
  • Does the trainer support max_steps=-1 and validation?
  • Does the trainer only call APIs defined in alignable_interface.py?
  • Does the trainer have proper logging?

Additional Information

  • Related to # (issue)

Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
- rejected_generated_rewards: as above but for rejected responses
- rewards_chosen_mean: see below for a definition of what reward means in this context
- rewards_rejected_mean: as above but for rejected responses
- bad_samples_per_GBS: the percentage of samples in a GBS which are excluded from training because of bad output from the LLM-as-a-judge (could be caused by parse errors, or all responses being judge with the same score, etc)
Copy link
Collaborator

@jgerh jgerh Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revise

bad_samples_per_GBS: The percentage of samples in a GBS which are excluded from training because of bad output from the LLM-as-a-judge (this could be caused by parse errors, or all responses being judge with the same score, etc.).

- rewards_chosen_mean: see below for a definition of what reward means in this context
- rewards_rejected_mean: as above but for rejected responses
- bad_samples_per_GBS: the percentage of samples in a GBS which are excluded from training because of bad output from the LLM-as-a-judge (could be caused by parse errors, or all responses being judge with the same score, etc)
- bad_ends_per_GBS: only valid if using TRT, this tracks the percentage of each GBS where TRT generates incorrect stop tokens (should be really low, < 1%)
Copy link
Collaborator

@jgerh jgerh Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revvise

bad_ends_per_GBS: Only valid if using TRT. This tracks the percentage of each GBS where TRT generates incorrect stop tokens (should be really low, < 1%).

- rewards_rejected_mean: as above but for rejected responses
- bad_samples_per_GBS: the percentage of samples in a GBS which are excluded from training because of bad output from the LLM-as-a-judge (could be caused by parse errors, or all responses being judge with the same score, etc)
- bad_ends_per_GBS: only valid if using TRT, this tracks the percentage of each GBS where TRT generates incorrect stop tokens (should be really low, < 1%)
- preference_loss: the raw DPO variant loss
Copy link
Collaborator

@jgerh jgerh Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revise

preference_loss: The raw DPO variant loss.

- bad_samples_per_GBS: the percentage of samples in a GBS which are excluded from training because of bad output from the LLM-as-a-judge (could be caused by parse errors, or all responses being judge with the same score, etc)
- bad_ends_per_GBS: only valid if using TRT, this tracks the percentage of each GBS where TRT generates incorrect stop tokens (should be really low, < 1%)
- preference_loss: the raw DPO variant loss
- sft_loss: if adding an SFT loss (categorical cross-entropy loss) for the chosen response, then you can see that raw loss here
Copy link
Collaborator

@jgerh jgerh Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revise

sft_loss: If adding an SFT loss (categorical cross-entropy loss) for the chosen response, then you can see that raw loss here.

- preference_loss: the raw DPO variant loss
- sft_loss: if adding an SFT loss (categorical cross-entropy loss) for the chosen response, then you can see that raw loss here

The ``reward`` in this case is calculated as the difference between model log probs and the reference log probs, multiplied by the KL penalty (beta in the original paper), for the ground truth and generated responses.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix punctuation.

The reward, in this case, is calculated as the difference between model log probs and the reference log probs, multiplied by the KL penalty (beta in the original paper), for the ground truth and generated responses.

All metrics will be grouped by either ``train/`` or ``val/`` in WandB, representing whether that metric is from the training or validation set, respectively.
You can also see a table which will print out the prompt, chosen response, and rejected response for each validation step. This allows you to keep track of response quality and hallucinations.

When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.
Copy link
Collaborator

@jgerh jgerh Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix capitalization, revise sentence.

When it comes to ideal hyperparameters for self-rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data. Therefore, there is no one-size-fits-all parameter set that will work in all cases.

You can also see a table which will print out the prompt, chosen response, and rejected response for each validation step. This allows you to keep track of response quality and hallucinations.

When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.
Additionally, Self-Rewarding (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix capitalization, revise.

Additionally, self-rewarding training (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.


When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.
Additionally, Self-Rewarding (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.
Below are some of observations from the Nvidia Alignment team as to what parameters we have seen work well:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix capitalization, revise sentence.

Below are some observations from the NVIDIA Alignment team regarding parameters that we have found to work well:

Additionally, Self-Rewarding (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.
Below are some of observations from the Nvidia Alignment team as to what parameters we have seen work well:

* global_batch_size: we recommend using 64, and going up to 128 only for large models (70B+) that are also training with large datasets
Copy link
Collaborator

@jgerh jgerh Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revise

global_batch_size: We recommend using 64, and increasing to 128 only for large models (70B+) that are also training with large datasets.

Below are some of observations from the Nvidia Alignment team as to what parameters we have seen work well:

* global_batch_size: we recommend using 64, and going up to 128 only for large models (70B+) that are also training with large datasets
* iterations/epochs: the original paper uses 3 iterations with 1 epoch per iteration, and we find this to be sufficient for most use cases
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revise

iterations/epochs: The original paper uses 3 iterations with 1 epoch per iteration. We find this to be sufficient for most use cases.


* global_batch_size: we recommend using 64, and going up to 128 only for large models (70B+) that are also training with large datasets
* iterations/epochs: the original paper uses 3 iterations with 1 epoch per iteration, and we find this to be sufficient for most use cases
* learning rate: for SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revise

learning rate: For SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7 is recommended.

* global_batch_size: we recommend using 64, and going up to 128 only for large models (70B+) that are also training with large datasets
* iterations/epochs: the original paper uses 3 iterations with 1 epoch per iteration, and we find this to be sufficient for most use cases
* learning rate: for SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7.
* ref_policy_kl_penalty: we did not see large changes from perturbations to this value; we recommend 0.1 - 0.001
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revise

ef_policy_kl_penalty: We did not see large changes from perturbations to this value. We recommend 0.1 - 0.001.

* iterations/epochs: the original paper uses 3 iterations with 1 epoch per iteration, and we find this to be sufficient for most use cases
* learning rate: for SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7.
* ref_policy_kl_penalty: we did not see large changes from perturbations to this value; we recommend 0.1 - 0.001
* length_control: depends very much on model size and data, but we found good results with [0,0,0.1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revise

length_control: This parameter depends very much on model size and data, but we found good results with [0,0,0.1].

* learning rate: for SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7.
* ref_policy_kl_penalty: we did not see large changes from perturbations to this value; we recommend 0.1 - 0.001
* length_control: depends very much on model size and data, but we found good results with [0,0,0.1]
* use_meta_judge: we have found stronger results when settings this to true, which is in line with the paper's results
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revise

use_meta_judge: We found stronger results when setting this parameter to true, which is in line with the paper's results

* ref_policy_kl_penalty: we did not see large changes from perturbations to this value; we recommend 0.1 - 0.001
* length_control: depends very much on model size and data, but we found good results with [0,0,0.1]
* use_meta_judge: we have found stronger results when settings this to true, which is in line with the paper's results
* meta_judge_pcnt: we recommend you do not set this higher than 0.15 (15%). Any higher, and we have observed that the llm-as-a-judge model starts to output identical scores for every response (always a 5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revise

meta_judge_pcnt: We recommend not setting this higher than 0.15 (15%). Any higher, and we have observed that the LLM-as-a-judge model starts to output identical scores for every response (always a 5).

Copy link
Collaborator

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completed the technical edit of CHANGELOG.md and
docs/user-guide/self_rewarding.rst. Please review the edits, make the changes in the files, and mark each open thread "resolved."

Copy link
Collaborator

@odelalleau odelalleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still WIP but submitting first batch of comments

CHANGELOG.md Outdated Show resolved Hide resolved
docs/user-guide/self_rewarding.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file needed for Self-Rewarding? If not let's move it to a different PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed if you want to follow the self rewarding paper exactly to generate the EFT dataset

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it'd be good to keep it then, but it also needs to be documented so that people understand how to generate this EFT dataset. At quick glance I'm not seeing it referenced in the self-rewarding doc => could you add it to explain how to generate an EFT dataset?

examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_self_rewarding.yaml Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_self_rewarding.yaml Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_self_rewarding.yaml Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved
Copy link
Collaborator

@odelalleau odelalleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of minor typos

examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved
@jgerh
Copy link
Collaborator

jgerh commented Nov 26, 2024

I completed the technical edit of CHANGELOG.md and
docs/user-guide/self_rewarding.rst. Please review the edits, make the changes in the files, and mark each open thread "resolved."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algorithms CI documentation Improvements or additions to documentation Run CICD Set + un-set to retrigger Utils
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants