Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation

Sarik Ghazarian, Zixi Liu, Akash S. M, Ralph Weischedel, Aram Galstyan, and Nanyun Peng, in The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.

Download the full text

Abstract

With the recent advances of open-domain story generation models, the lack of reliable automatic evaluation metrics becomes an increasingly imperative issue that hinders the development of such models. A critical bottleneck of obtaining a trustworthy learnable evaluation metric is the lack of high-quality training data for learning classifiers to efficiently distinguish between plausible and implausible machine-generated stories. Previous works relied on heuristically manipulate plausible examples to mimic possible system drawbacks such as repetition, contradiction, or irrelevant content in the text level, which can be unnatural and oversimplify the characteristics of implausible machine-generated stories. We propose to tackle these issues by generating a more comprehensive set of implausible stories using plots, which are structured representations of controllable factors used to generate stories. Since these plots are compact and structured, it is easier to manipulate them to generate text with targeted undesirable properties, while at the same time maintain the naturalness of the generation. To improve the quality of incoherent stories, we further apply the adversarial filtering procedure to select a more nuanced set of implausible texts. We find that the evaluation metrics trained on our generated data result in more reliable automatic assessments that correlate remarkably better with human judgments than other baselines.

In our first paper in the title "Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation", we tried to achieve a more accurate story plausibility evaluator by proposing a more comprehensive set of incoherent stories based on plot manipulations.
— Sarik (@Sarikgha) March 19, 2021

Bib Entry

@inproceedings{ghazarian2021plot,
  title = {Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation},
  author = {Ghazarian, Sarik and Liu, Zixi and M, Akash S and Weischedel, Ralph and Galstyan, Aram and Peng, Nanyun},
  booktitle = {The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  publisher = {Association for Computational Linguistics},
  pages = {4334–-4344},
  year = {2021}
}

Related Publications

Open-Domain Text Evaluation via Contrastive Distribution Methods

Sidi Lu, Hongyi Liu, Asli Celikyilmaz, Tianlu Wang, and Nanyun Peng, in Proceedings of the Fortieth International Conference on Machine Learning (ICML), 2024.
Full Text Abstract BibTeX Details

Recent advancements in open-domain text generation, driven by the power of large pre-trained language models (LLMs), have demonstrated remarkable performance. However, assessing these models’ generation quality remains a challenge. In this paper, we introduce a novel method for evaluating open-domain text generation called Contrastive Distribution Methods (CDM). Leveraging the connection between increasing model parameters and enhanced LLM performance, CDM creates a mapping from the _contrast_ of two probabilistic distributions – one known to be superior to the other – to quality measures. We investigate CDM for open-domain text generation evaluation under two paradigms: 1) _Generative_ CDM, which harnesses the contrast of two language models’ distributions to generate synthetic examples for training discriminator-based metrics; 2) _Discriminative_ CDM, which directly uses distribution disparities between two language models for evaluation. Our experiments on coherence evaluation for multi-turn dialogue and commonsense evaluation for controllable generation demonstrate CDM’s superior correlate with human judgment than existing automatic evaluation metrics, highlighting the strong performance and generalizability of our approach.

@inproceedings{lu2024cdm,
  title = {Open-Domain Text Evaluation via Contrastive Distribution Methods},
  author = {Lu, Sidi and Liu, Hongyi and Celikyilmaz, Asli and Wang, Tianlu and Peng, Nanyun},
  booktitle = {Proceedings of the Fortieth International Conference on Machine Learning (ICML)},
  year = {2024}
}

Details

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, and Nanyun Peng, in Proceedings of the Fortieth International Conference on Machine Learning (ICML), 2024.
Full Text BibTeX Details

@inproceedings{wadhawan2024contextual,
  title = {ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models},
  author = {Wadhawan, Rohan and Bansal, Hritik and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {Proceedings of the Fortieth International Conference on Machine Learning (ICML)},
  year = {2024}
}

Details

AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation

Haoyi Qiu, Kung-Hsiang Huang, Jingnong Qu, and Nanyun Peng, in Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024.
Full Text Code BibTeX Details

@inproceedings{qiu2024amrfact,
  title = {AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation},
  author = {Qiu, Haoyi and Huang, Kung-Hsiang and Qu, Jingnong and Peng, Nanyun},
  booktitle = {Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year = {2024}
}

Details

ACCENT: An Automatic Event Commonsense Evaluation Metric for Open-Domain Dialogue Systems

Sarik Ghazarian*, Yijia Shao*, Rujun Han, Aram Galstyan, and Nanyun Peng, in Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
Full Text Abstract BibTeX Details

Commonsense reasoning is omnipresent in human communications and thus is an important feature for open-domain dialogue systems. However, evaluating commonsense in dialogue systems is still an open challenge. We take the first step by focusing on event commonsense that considers events and their relations, and is crucial in both dialogues and general commonsense reasoning. We propose ACCENT, an event commonsense evaluation metric empowered by commonsense knowledge bases (CSKBs). ACCENT first extracts event-relation tuples from a dialogue, and then evaluates the response by scoring the tuples in terms of their compatibility with the CSKB. To evaluate ACCENT, we construct the first public event commonsense evaluation dataset for open-domain dialogues.Our experiments show that ACCENT is an efficient metric for event commonsense evaluation, which achieves higher correlations with human judgments than existing baselines.

@inproceedings{ghazarian2023accent,
  title = {ACCENT: An Automatic Event Commonsense Evaluation Metric for Open-Domain Dialogue Systems},
  author = {Ghazarian*, Sarik and Shao*, Yijia and Han, Rujun and Galstyan, Aram and Peng, Nanyun},
  booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2023}
}

Details

EnDex: Evaluation of Dialogue Engagingness at Scale

Guangxuan Xu, Nischal Reddy Chandra, Ruibo Liu, Fabrice Harel-Canada, and Nanyun Peng, in Findings of the Association for Computational Linguistics: EMNLP (EMNLP-findings), 2022.
Full Text Abstract BibTeX Details

We propose EnDex, the first human-reaction based model to evaluate dialogue engagingness. EnDex is trained on 80k Reddit-based Engagement Dataset (RED) curated using a novel distant-supervision framework. Engagingness is a key measure that captures high-level quality of AI dialogue systems and closely reflects actual user experience. However, data shortage, plus the abstract and extensive definition of engagingness makes it challenging to develop an automatic metric. Our work departs from mainstream approaches that use synthetic negative examples to train binary classifiers, and instead, proposes a solution using distant-supervision from human-reaction feedback. To support the soundness of our EnDex metric, we offer a theoretical foundation for engagement, an extensive ablation study, and empirical evidence of high correlation on five engagingness related datasets. We will release code, off-the-shelf EnDex model, and a large-scale dataset upon paper publication to facilitate future research.

@inproceedings{xu2022endex,
  title = {EnDex: Evaluation of Dialogue Engagingness at Scale},
  author = {Xu, Guangxuan and Chandra, Nischal Reddy and Liu, Ruibo and Harel-Canada, Fabrice and Peng, Nanyun},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP (EMNLP-findings)},
  year = {2022}
}

Details

DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations

Sarik Ghazarian, Nuan Wen, Aram Galstyan, and Nanyun Peng, in Proceedings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
Full Text Abstract BibTeX Details

Automatic evaluation metrics are essential for the rapid development of open-domain dialogue systems as they facilitate hyper-parameter tuning and comparison between models. Although recently proposed trainable conversation-level metrics have shown encouraging results, the quality of the metrics is strongly dependent on the quality of training data. Prior works mainly resort to heuristic text-level manipulations (e.g. utterances shuffling) to bootstrap incoherent conversations (negative examples) from coherent dialogues (positive examples). Such approaches are insufficient to appropriately reflect the incoherence that occurs in interactions between advanced dialogue models and humans. To tackle this problem, we propose DEAM, a Dialogue coherence Evaluation metric that relies on Abstract Meaning Representation (AMR) to apply semantic-level Manipulations for incoherent (negative) data generation. AMRs naturally facilitate the injection of various types of incoherence sources, such as coreference inconsistency, irrelevancy, contradictions, and decrease engagement, at the semantic level, thus resulting in more natural incoherent samples. Our experiments show that DEAM achieves higher correlations with human judgments compared to baseline methods on several dialog datasets by significant margins. We also show that DEAM can distinguish between coherent and incoherent dialogues generated by baseline manipulations, whereas those baseline models cannot detect incoherent examples generated by DEAM. Our results demonstrate the potential of AMR-based semantic manipulations for natural negative example generation.

@inproceedings{ghazarian2022deam,
  title = {DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations},
  author = {Ghazarian, Sarik and Wen, Nuan and Galstyan, Aram and Peng, Nanyun},
  booktitle = {Proceedings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2022}
}

Details

Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation

Sarik Ghazarian, Zixi Liu, Akash S. M, Ralph Weischedel, Aram Galstyan, and Nanyun Peng, in The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.
Full Text Slides Code Abstract BibTeX Details

With the recent advances of open-domain story generation models, the lack of reliable automatic evaluation metrics becomes an increasingly imperative issue that hinders the development of such models. A critical bottleneck of obtaining a trustworthy learnable evaluation metric is the lack of high-quality training data for learning classifiers to efficiently distinguish between plausible and implausible machine-generated stories. Previous works relied on heuristically manipulate plausible examples to mimic possible system drawbacks such as repetition, contradiction, or irrelevant content in the text level, which can be unnatural and oversimplify the characteristics of implausible machine-generated stories. We propose to tackle these issues by generating a more comprehensive set of implausible stories using plots, which are structured representations of controllable factors used to generate stories.  Since these plots are compact and structured, it is easier to manipulate them to generate text with targeted undesirable properties, while at the same time maintain the naturalness of the generation. To improve the quality of incoherent stories, we further apply the adversarial filtering procedure to select a more nuanced set of implausible texts. We find that the evaluation metrics trained on our generated data result in more reliable automatic assessments that correlate remarkably better with human judgments than other baselines.

@inproceedings{ghazarian2021plot,
  title = {Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation},
  author = {Ghazarian, Sarik and Liu, Zixi and M, Akash S and Weischedel, Ralph and Galstyan, Aram and Peng, Nanyun},
  booktitle = {The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  publisher = {Association for Computational Linguistics},
  pages = {4334–-4344},
  year = {2021}
}

Details

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

Sarik Ghazarian, Ralph Weischedel, Aram Galstyan, and Nanyun Peng, in The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), 2020.
Full Text Code Abstract BibTeX Details

User engagement is a critical metric for evaluating the quality of open-domain dialogue systems. Prior work has focused on conversation-level engagement by using heuristically constructed features such as the number of turns and the total time of the conversation. In this paper, we investigate the possibility and efficacy of estimating utterance-level engagement and define a novel metric, predictive engagement, for automatic evaluation of open-domain dialogue systems. Our experiments demonstrate that (1) human annotators have high agreement on assessing utterance-level engagement scores; (2) conversation-level engagement scores can be predicted from properly aggregated utterance-level engagement scores. Furthermore, we show that the utterance-level engagement scores can be learned from data. These scores can be incorporated into automatic evaluation metrics for open-domain dialogue systems to improve the correlation with human judgements. This suggests that predictive engagement can be used as a real-time feedback for training better dialogue models.

@inproceedings{ghazarian2020predictive,
  title = {Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems},
  author = {Ghazarian, Sarik and Weischedel, Ralph and Galstyan, Aram and Peng, Nanyun},
  booktitle = {The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)},
  pages = {7789–-7796},
  year = {2020}
}

Details

Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings

Sarik Ghazarian, Johnny Tian-Zheng Wei, Aram Galstyan, and Nanyun Peng, in 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019), NeuralGen Workshop, 2019.
Full Text BibTeX Details

@inproceedings{ghazarian2019better,
  title = {Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings},
  author = {Ghazarian, Sarik and Wei, Johnny Tian-Zheng and Galstyan, Aram and Peng, Nanyun},
  booktitle = {2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019), NeuralGen Workshop},
  year = {2019}
}

Details

Evaluating and Enhancing the Robustness of Retrieval-Based Dialogue Systems with Adversarial Examples

Jia Li, Chongyang Tao, Nanyun Peng, Wei Wu, Dongyan Zhao, and Rui Yan, in CCF International Conference on Natural Language Processing and Chinese Computing, 2019.
Full Text BibTeX Details

@inproceedings{li2019evaluating,
  title = {Evaluating and Enhancing the Robustness of Retrieval-Based Dialogue Systems with Adversarial Examples},
  author = {Li, Jia and Tao, Chongyang and Peng, Nanyun and Wu, Wei and Zhao, Dongyan and Yan, Rui},
  booktitle = {CCF International Conference on Natural Language Processing and Chinese Computing},
  pages = {142--154},
  year = {2019},
  organization = {Springer}
}

Details