Improving Faithfulness of Text-to-Image Diffusion Models through Inference Intervention

Danfeng Guo, Sanchit Agarwal, Yu-Hsiang Lin, Jiun-Yu Kao, Tagyoung Chung, Nanyun Peng, and Mohit Bansal, in Proceedings of The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025.

Download the full text

Abstract

Text-to-image diffusion models excel at producing high-quality imagery yet often violate details in the text prompt. Existing fixes either fine-tune the backbone or apply gradient-based edits during inference—both costly and usually limited to narrow error types (e.g. object count). We propose an \emphintervention-and-correction pipeline that controls the denoising process without back-propagation. The model detects missing or incorrect objects mid-generation, constructs feedback layouts (optionally augmented via retrieval), rewinds to an earlier denoising step, and fuses corrected latents with the original ones. On VPEval and HRS-Bench, our method boosts faithfulness across object presence, count, scale and spatial-relation metrics, outperforming state-of-the-art GLIGEN by +6.7% average accuracy.

Bib Entry

@inproceedings{guo2025faithfulness,
  author = {Guo, Danfeng and Agarwal, Sanchit and Lin, Yu-Hsiang and Kao, Jiun-Yu and Chung, Tagyoung and Peng, Nanyun and Bansal, Mohit},
  title = {Improving Faithfulness of Text-to-Image Diffusion Models through Inference Intervention},
  booktitle = {Proceedings of The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year = {2025}
}