You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
https://github.com/ContextualAI/HALOs/tree/main
This is a new human alignment method, different from DPO and PPO, it does not depend on pairwise comparison data but work on pointwise evaluation data. According to the paper, KTO outperforms DPO , PPO in a few public benchmarks.
TRL also incorporates KTO loss into its dpo trainer and have opened a PR to create a KTOTrainer to enable e2e KTO training. huggingface/trl#1181
Given its promise, would suggest to support it in this repo.
If nobody is working on this now, I would like to work on it.
Any concerns, please let me know.
The KTO loss could be simply sketched as:
`class SimpleKTOTrainer(UnpairedPreferenceTrainer):
"""A simple version of KTO meant to introduce you to the HALOs repo."""
def loss(self,
policy_chosen_logps: torch.FloatTensor,
policy_rejected_logps: torch.FloatTensor,
reference_chosen_logps: torch.FloatTensor,
reference_rejected_logps: torch.FloatTensor) -> Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]:
"""Compute the Kahneman-Tversky loss for a batch of policy and reference model log probabilities.
For each batch of n/2 chosen examples and n/2 rejected examples (belonging to n different inputs), calculate the loss as follows.
If generation y ~ p_chosen, where x' ~ are the examples with rejected generations, we have the 'chosen' loss:
L(x, y) := 1 - sigmoid(beta * (log p_policy(y|x) - log p_reference(y|x) - KL(p_policy(y_rejected|x') || p_reference(y_rejected|x')))
If generation y ~ p_rejected, , where x' ~ are the examples with chosen generations, we have the 'rejected' loss:
L(x, y) := 1 - sigmoid(beta * KL(p_policy(y_chosen|x') || p_reference(y_chosen|x')) - [log p_policy(y|x) - log p_reference(y|x)])
"""
return losses, chosen_rewards, rejected_rewards`
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
🚀 The feature, motivation, and pitch
https://github.com/ContextualAI/HALOs/tree/main
This is a new human alignment method, different from DPO and PPO, it does not depend on pairwise comparison data but work on pointwise evaluation data. According to the paper, KTO outperforms DPO , PPO in a few public benchmarks.
TRL also incorporates KTO loss into its dpo trainer and have opened a PR to create a KTOTrainer to enable e2e KTO training. huggingface/trl#1181
Given its promise, would suggest to support it in this repo.
If nobody is working on this now, I would like to work on it.
Any concerns, please let me know.
The KTO loss could be simply sketched as:
`class SimpleKTOTrainer(UnpairedPreferenceTrainer):
"""A simple version of KTO meant to introduce you to the HALOs repo."""
def loss(self,
policy_chosen_logps: torch.FloatTensor,
policy_rejected_logps: torch.FloatTensor,
reference_chosen_logps: torch.FloatTensor,
reference_rejected_logps: torch.FloatTensor) -> Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]:
"""Compute the Kahneman-Tversky loss for a batch of policy and reference model log probabilities.
For each batch of n/2 chosen examples and n/2 rejected examples (belonging to n different inputs), calculate the loss as follows.
If generation y ~ p_chosen, where x' ~ are the examples with rejected generations, we have the 'chosen' loss:
L(x, y) := 1 - sigmoid(beta * (log p_policy(y|x) - log p_reference(y|x) - KL(p_policy(y_rejected|x') || p_reference(y_rejected|x')))
If generation y ~ p_rejected, , where x' ~ are the examples with chosen generations, we have the 'rejected' loss:
L(x, y) := 1 - sigmoid(beta * KL(p_policy(y_chosen|x') || p_reference(y_chosen|x')) - [log p_policy(y|x) - log p_reference(y|x)])
"""
return losses, chosen_rewards, rejected_rewards`
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: