Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety

Yiwei Wang, Muhao Chen, Nanyun Peng, and Kai-Wei Chang, in Findings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-Findings), 2025.

Download the full text

Abstract

Prior jailbreak studies mainly optimize the \emphcontent of adversarial snippets injected into prompts. We instead ask whether \emphwhere that snippet appears matters. We discover that placing a simple, human-readable adversarial string \emphat the very beginning of the output—an \textitoutput-prefix jailbreak—exposes safety vulnerabilities far more effectively than input-suffix or prompt-based jailbreaks. Directly forcing a user-specified output prefix dramatically increases attack success rates, revealing a positional weakness in existing LLM safety training.

Bib Entry

@inproceedings{wang2025vulnerability,
  author = {Wang, Yiwei and Chen, Muhao and Peng, Nanyun and Chang, Kai-Wei},
  title = {Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety},
  booktitle = {Findings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-Findings)},
  year = {2025}
}