We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
arxiv: Direct Preference Optimization: Your Language Model is Secretly a Reward Model
RLHF vs DPO
https://huggingface.co/docs/trl/main/en/online_dpo_trainer
https://huggingface.co/docs/trl/en/dpo_trainer
There was an error while loading. Please reload this page.