Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, and Aditya Grover, in Data-centric Machine Learning Research (DMLR) Workshop at The International Conference on Machine Learning (ICML), 2024.

Download the full text

Abstract

Human-preference alignment typically relies on pairwise comparisons of generations given a fixed prompt. The authors propose \emphJoint Preference Optimization (JPO), which instead collects preferences over \emphwhole instruction–response pairs and optimizes the joint probability of a chosen pair over a rejected one. Training LLMs with JPO yields win-rate gains of 5.2% on summarization and 3.3% on open-ended dialogue versus the popular DPO baseline, showing that joint preferences capture richer alignment signals.

Bib Entry

@inproceedings{bansal2024alignment,
  author = {Bansal, Hritik and Suvarna, Ashima and Bhatt, Gantavya and Peng, Nanyun and Chang, Kai-Wei and Grover, Aditya},
  title = {Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization},
  booktitle = {Data-centric Machine Learning Research (DMLR) Workshop at The International Conference on Machine Learning (ICML)},
  year = {2024}
}