Explaining Mixtures of Sources in News Articles
Alexander Spangher, James Youn, Matt DeButts, Nanyun Peng, and Jonathan May, in Proceedings of the Findings of ACL at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings), 2024.
Abstract
Human writers plan, then write. For large language models (LLMs) to play a role in longer-form article generation, we must understand the planning steps humans make before writing. We explore one kind of planning, source-selection in news, as a case-study for evaluating plans in long-form generation. We ask: why do specific stories call for specific kinds of sources? We imagine a process where sources are selected to fall into different categories. Learning the article’s plan means predicting the categorization scheme chosen by the journalist. Inspired by latent-variable modeling, we first develop metrics to select the most likely plan underlying a story. Then, working with professional journalists, we adapt five existing approaches to planning and introduce three new ones. We find that two approaches, or schemas: stance and social affiliation best explain source plans in most documents. However, other schemas like textual entailment explain source plans in factually rich topics like "Science". Finally, we find we can predict the most suitable schema given just the article’s headline with reasonable accuracy. We see this as an important case-study for human planning, and provides a framework and approach for evaluating other kinds of plans, like discourse or plot-oriented plans. We release a corpora, NewsSources, with schema annotations for 4M articles, for further study.
Bib Entry
@inproceedings{spangher2024source_explaining, author = {Spangher, Alexander and Youn, James and DeButts, Matt and Peng, Nanyun and May, Jonathan}, title = {Explaining Mixtures of Sources in News Articles}, booktitle = {Proceedings of the Findings of ACL at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings)}, year = {2024} }