MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou, Mohsen Fayyaz, Pan Lu, Kai-Wei Chang, and Nanyun Peng, in Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025.

Download the full text

Abstract

Existing multimodal retrieval benchmarks mainly test whether models can exploit \textittextual knowledge. Yet many real-world scenarios benefit more from retrieving \textitvisual information. We introduce MRAG-Bench, a retrieval-augmented generation benchmark covering 9 scenarios where images are superior to text. It contains 16 130 images and 1 353 multiple-choice questions. We evaluate 10 open-source and 4 proprietary LVLMs and find that every model gains more from image retrieval than text retrieval, confirming MRAG-Bench’s vision-centric nature. Even GPT-4o realizes only a 5.82% boost with ground-truth images versus 33.16% for humans, underscoring ample headroom for improving visual retrieval-augmented reasoning.

Bib Entry

@inproceedings{hu2025mrag,
  author = {Hu, Wenbo and Gu, Jia{-}Chen and Dou, Zi{-}Yi and Fayyaz, Mohsen and Lu, Pan and Chang, Kai{-}Wei and Peng, Nanyun},
  title = {MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models},
  booktitle = {Proceedings of the Thirteenth International Conference on Learning Representations (ICLR)},
  year = {2025}
}