-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CGPO] Calibrated reward #2155
[CGPO] Calibrated reward #2155
Conversation
the way i understood the calibrated reward was that the scores from a reward model might not be comparable across different completions given a prompt, and thus for some completion baseline or ground-truth completion my implementation is: def _compute_calib_rewards(self, completions, prompts, ground_truth_completions):
context_length = prompts["input_ids"].shape[1]
with torch.no_grad():
_, generated_scores, _ = get_reward(
self.reward_model, completions["input_ids"], self.tokenizer.pad_token_id, context_length
)
# Compute scores for ground-truth completions
ground_truth_input_ids = torch.cat([prompts["input_ids"], ground_truth_completions["input_ids"]], dim=1)
_, ground_truth_scores, _ = get_reward(
self.reward_model, ground_truth_input_ids, self.tokenizer.pad_token_id, context_length
)
if self.args.missing_eos_penalty is not None:
completion_contain_eos = torch.any(completions["input_ids"] == self.tokenizer.eos_token_id, dim=-1)
generated_scores[~completion_contain_eos] -= self.args.missing_eos_penalty
ground_truth_contain_eos = torch.any(
ground_truth_completions["input_ids"] == self.tokenizer.eos_token_id, dim=-1
)
ground_truth_scores[~ground_truth_contain_eos] -= self.args.missing_eos_penalty
return F.sigmoid(generated_scores - ground_truth_scores) |
Thanks for looking at it @kashif. Your code and mine are exactly the same except for the Also, I am computing the reward for both Example using your naming conventions and assuming both batch_size = query_responses.shape[0]
concatenated_responses = torch.cat(
(query_responses, baseline_responses),
dim=0,
)
reward_logits, final_rewards, sequence_lengths = get_reward(
model, concatenated_responses, pad_token_id, context_length
)
generated_scores, ground_truth_scores = final_rewards.split(batch_size, dim=0)
final_rewards = F.sigmoid(generated_scores-ground_truth_scores) For the returns, I am also returning all the My implementation lacked a sigmoid function for the reward_logits thought. Am I correct? |
ah right right! you are right! |
so the reason i have the stuff split is because of padding... when i join the two different completions i have to pad them together while its slightly easier to pad each completion... and then i was scared if by contacting the memory needs might be too much for largish reward models... but yes makes sense |
Yes @kashif 100% agree, there are pros and cons for both methods. It also depends on the distributed training strategy you are using to train the model. In any case, I checked the trl/trl/trainer/online_dpo_trainer.py Lines 416 to 427 in 78249d9
I think we should keep it this way. Wdyt? |
Closing in favor of #2190 |
What does this PR do?
Adds a
get_calibrated_reward
function as introduced in the CGPO paper of Meta. Please refer to equation 5 in section 4.1.1 for more information on this (https://arxiv.org/pdf/2409.20370).This PR should be part of a set of PRs to incorporate CGPO in trl.
Fixes # (issue)
#2156
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines.
Who can review?
@kashif @lewtun