Share this page:

Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks?

Xuan He, Da Yin, and Nanyun Peng, in Proceedings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025.

Download the full text


Abstract

We study how \emphweak teachers—average human annotators or existing AI systems—can best supervise LLMs on hard reasoning tasks. Two strategies arise: (i) lower-quality supervision on tasks matching the target difficulty, and (ii) higher-quality supervision on easier subtasks. Surprisingly, even with outcome error rates as high as 90%, hard-task supervision can beat perfectly correct subtask supervision on multiple math benchmarks. A key driver is \emphstep-wise error rate: lowering step errors at equal outcome errors yields up to a 30% accuracy swing on MATH. Mixing hard-task and subtask data further boosts performance, suggesting promising data-augmentation directions.


Bib Entry

@inproceedings{he2025guiding,
  author = {He, Xuan and Yin, Da and Peng, Nanyun},
  title = {Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks?},
  booktitle = {Proceedings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year = {2025}
}

Related Publications