Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization
Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, and Aditya Grover, in Data-centric Machine Learning Research (DMLR) Workshop at The International Conference on Machine Learning (ICML), 2024.
Download the full text
Abstract
Human-preference alignment typically relies on pairwise comparisons of generations given a fixed prompt. The authors propose \emphJoint Preference Optimization (JPO), which instead collects preferences over \emphwhole instruction–response pairs and optimizes the joint probability of a chosen pair over a rejected one. Training LLMs with JPO yields win-rate gains of 5.2% on summarization and 3.3% on open-ended dialogue versus the popular DPO baseline, showing that joint preferences capture richer alignment signals.
Bib Entry
@inproceedings{bansal2024alignment, author = {Bansal, Hritik and Suvarna, Ashima and Bhatt, Gantavya and Peng, Nanyun and Chang, Kai-Wei and Grover, Aditya}, title = {Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization}, booktitle = {Data-centric Machine Learning Research (DMLR) Workshop at The International Conference on Machine Learning (ICML)}, year = {2024} }