Publications

Preprint

2025

Contrastive Visual Data Augmentation

Yu Zhou, Bingxuan Li, Tang Mohan, Xiaomeng Jin, Te-Lin Wu, Kuan-Hao Huang, Heng Ji, Kai-Wei Chang, and Nanyun Peng, in Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025.
Full Text Code Abstract BibTeX Details

Large multimodal models (LMMs) often struggle to recognize novel concepts, as they rely on pre-trained knowledge and have limited ability to capture subtle visual details. Domain-specific knowledge gaps in training also make them prone to confusing visually similar, commonly misrepresented, or low-resource concepts. To help LMMs better align nuanced visual features with language, improving their ability to recognize and reason about novel or rare concepts, we propose a Contrastive visual Data Augmentation (CoDA) strategy. CoDA extracts key contrastive textual and visual features of target concepts against the known concepts they are misrecognized as, and then uses multimodal generative models to produce targeted synthetic data. Automatic filtering of extracted features and augmented images is implemented to guarantee their quality, as verified by human annotators. We show the effectiveness and efficiency of CoDA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect NovelSpecies, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. LLaVA-1.6 1-shot updating results on these three datasets show CoDA significantly improves SOTA visual data augmentation strategies by 12.3% (NovelSpecies), 5.1% (SUN), and 6.0% (iNat) absolute gains in accuracy.

@inproceedings{zhou2025coda,
  title = {Contrastive Visual Data Augmentation},
  author = {Zhou, Yu and Li, Bingxuan and Mohan, Tang and Jin, Xiaomeng and Wu, Te-Lin and Huang, Kuan-Hao and Ji, Heng and Chang, Kai-Wei and Peng, Nanyun},
  year = {2025},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning (ICML)},
  project_website = {https://contrastive-visual-data-augmentation.github.io/}
}

Details

Scaling Probabilistic Circuits via Monarch Matrices

Honghua Zhang, Meihua Dang, Benjie Wang, Stefano Ermon, Nanyun Peng, and Guy Van den Broeck, in Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025.
Full Text Abstract BibTeX Details

Probabilistic Circuits (PCs) are tractable representations of probability distributions allowing for exact and efficient computation of likelihoods and marginals. Recent advancements have improved the scalability of PCs either by leveraging their sparse properties or through the use of tensorized operations for better hardware utilization. However, no existing method fully exploits both aspects simultaneously. In this paper, we propose a novel sparse and structured parameterization for the sum blocks in PCs. By replacing dense matrices with sparse Monarch matrices, we significantly reduce the memory and computation costs, enabling unprecedented scaling of PCs. From a theory perspective, our construction arises naturally from circuit multiplication; from a practical perspective, compared to previous efforts on scaling up tractable probabilistic models, our approach not only achieves state-of-the-art generative modeling performance on challenging benchmarks like Text8, LM1B and ImageNet, but also demonstrates superior scaling behavior, achieving the same performance with substantially less compute as measured by the number of floating-point operations (FLOPs) during training.

@inproceedings{zhang2025monarch,
  title = {Scaling Probabilistic Circuits via Monarch Matrices},
  author = {Zhang, Honghua and Dang, Meihua and Wang, Benjie and Ermon, Stefano and Peng, Nanyun and den Broeck, Guy Van},
  year = {2025},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning (ICML)}
}

Details

Model Extrapolation Expedites Alignment

Chujie Zheng, Ziqi Wang, Heng Ji, Minlie Huang, and Nanyun Peng, in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , 2025 .
Full Text Abstract BibTeX Details

 Given the high computational cost of alignment training of large language models (LLMs), exploring efficient methods to reduce the training overhead remains an important and compelling research problem. Drawing inspiration from literature on model interpolation, we propose a straightforward method called ExPO (model extrapolation) to expedite LLMs’ alignment with human preferences. We first observe that model interpolation typically leads to intermediate performance when applied to existing DPO/RLHF models and their SFT checkpoints. Motivated by this, we hypothetically treat a partially-trained model M1 as the interpolated result between the initial SFT model M0 and some better-aligned model M2. Then, we can obtain the hypothetical M2 by simply extrapolating the model weights along the direction from M0 to M1, thus reaching better alignment performance based on M1 without additional training overhead. We validate our hypothesis through controlled experiments, demonstrating that ExPO boosts a DPO model trained with only 20% steps to outperform the fully-trained one. Moreover, we show that ExPO notably improves existing open-source LLMs (ranging from 1.8B to 70B parameters) on the leading AlpacaEval 2.0 and MT-Bench benchmarks, highlighting ExPO’s broader utility in efficiently enhancing LLM alignment.

@inproceedings{zheng2025expo,
  title = { Model Extrapolation Expedites Alignment },
  author = {Zheng, Chujie and Wang, Ziqi and Ji, Heng and Huang, Minlie and Peng, Nanyun},
  year = { 2025 },
  booktitle = { Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) }
}

Details

Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence

Mohsen Fayyaz, Ali Modarressi, Hinrich Schuetze, and Nanyun Peng, in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , 2025 .
Full Text Abstract BibTeX Details

 Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid failures. In this work, by repurposing a relation extraction dataset (e.g. Re-DocRED), we design controlled experiments to quantify the impact of heuristic biases, such as favoring shorter documents, in retrievers like Dragon+ and Contriever. Our findings reveal significant vulnerabilities: retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches. Additionally, they tend to overlook whether the document contains the query’s answer, lacking deep semantic understanding. Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 3% of cases over a biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34% performance drop than not providing any documents at all.

@inproceedings{fayyaz2025collapse,
  title = { Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence },
  author = {Fayyaz, Mohsen and Modarressi, Ali and Schuetze, Hinrich and Peng, Nanyun},
  year = { 2025 },
  booktitle = { Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) }
}

Details

METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling

Bingxuan Li, Yiwei Wang, Jiuxiang Gu, Kai-Wei Chang, and Nanyun Peng, in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , 2025 .
Full Text Abstract BibTeX Details

 Chart generation aims to generate code to produce charts satisfying the desired visual properties, e.g., texts, layout, color, and type. It has great potential to empower the automatic professional report generation in financial analysis, research presentation, education, and healthcare. In this work, we build a vision-language model (VLM) based multi-agent framework for effective automatic chart generation. Generating high-quality charts requires both strong visual design skills and precise coding capabilities that embed the desired visual properties into code. Such a complex multi-modal reasoning process is difficult for direct prompting of VLMs. To resolve these challenges, we propose METAL, a multi-agent framework that decomposes the task of chart generation into the iterative collaboration among specialized agents. METAL achieves a 5.2% improvement in the F1 score over the current best result in the chart generation task. Additionally, METAL improves chart generation performance by 11.33% over Direct Prompting with LLaMA-3.2-11B. Furthermore, the METAL framework exhibits the phenomenon of test-time scaling: its performance increases monotonically as the logarithm of computational budget grows from 512 to 8192 tokens.

@inproceedings{li2025metal,
  title = { METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling },
  author = {Li, Bingxuan and Wang, Yiwei and Gu, Jiuxiang and Chang, Kai-Wei and Peng, Nanyun},
  year = { 2025 },
  booktitle = { Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) }
}

Details

SkillVerse: Assessing and Enhancing LLMs with Tree Evaluation

Yufei Tian, Jiao Sun, Nanyun Peng, and Zizhao Zhang, in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025.
Full Text Abstract BibTeX Details

As language models evolve to tackle complex, multifaceted tasks, their evaluation must adapt to capture this intricacy. A granular, skill-specific understanding of model capabilities can empower researchers to make informed model development plans. In this paper, we introduce SkillVerse, an unsupervised tree-structured diagnosis framework for understanding model proficiency in specific abilities. With LLM as a judge, SkillVerse first critiques the model responses, and then organizes them into a hierarchical structure termed dendrogram. Given proficiency at arbitrary levels of granularity, SkillVerse is flexible to produce insights of behaviors of modern large models. We also demonstrate its efficacy in two downstream tasks: (1) improving model in-context learning by 25% using a tree-search algorithm to select more informative few-shot demonstrations, and (2) accurately predicting new model weaknesses with a 55% success rate, 22% higher than without SkillVerse.

@inproceedings{tian2025skillverse,
  title = {SkillVerse: Assessing and Enhancing LLMs with Tree Evaluation},
  author = {Tian, Yufei and Sun, Jiao and Peng, Nanyun and Zhang, Zizhao},
  year = {2025},
  booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)}
}

Details

Sandcastles in the Storm: Revisiting the (Im)possibility of Strong Watermarking

Fabrice Y. Harel-Canada, Boran Erol, Connor Choi, Jason Liu, Gary Jiarui Song, Nanyun Peng, and Amit Sahai, in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , 2025 .
Full Text Abstract BibTeX Details

 Watermarking AI-generated text is critical for combating misuse. Yet recent theoretical work argues that any watermark can be erased via random walk attacks that perturb text while preserving quality. However, such attacks rely on two key assumptions: (1) rapid mixing (watermarks dissolve quickly under perturbations) and (2) reliable quality preservation (automated quality oracles perfectly guide edits). Through large-scale experiments and human-validated assessments, we find mixing is slow: 100% of perturbed texts retain traces of their origin after hundreds of edits, defying rapid mixing. Oracles falter, as state-of-the-art quality detectors misjudge edits (77% accuracy), compounding errors during attacks. Ultimately, attacks underperform: automated walks remove watermarks just 26% of the time – dropping to 10% under human quality review. These findings challenge the inevitability of watermark removal. Instead, practical barriers – slow mixing and imperfect quality control – reveal watermarking to be far more robust than theoretical models suggest. The gap between idealized attacks and real-world feasibility underscores the need for stronger watermarking methods and more realistic attack models.

@inproceedings{harelcanada2025sandcastles,
  title = { Sandcastles in the Storm: Revisiting the (Im)possibility of Strong Watermarking },
  author = {Harel-Canada, Fabrice Y and Erol, Boran and Choi, Connor and Liu, Jason and Song, Gary Jiarui and Peng, Nanyun and Sahai, Amit},
  year = { 2025 },
  booktitle = { Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) }
}

Details

Mind the Gesture: Evaluating AI Sensitivity to Culturally Offensive Non-Verbal Gestures

Akhila Yerukola, Saadia Gabriel, Nanyun Peng, and Maarten Sap, in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , 2025 .
Full Text Abstract BibTeX Details

 Non-verbal communication, including gestures and poses, varies significantly across cultures, sometimes leading to misinterpretations with serious social and diplomatic consequences. As AI systems become more integrated into global applications, there is a critical need to ensure that they do not inadvertently perpetuate cultural offenses. To this end, we introduce Multi-Cultural Set of Inappropriate Gestures and Nonverbal Signs (MC-SIGNS), a dataset of 288 gesture-country pairs annotated for offensiveness, cultural significance, and contextual factors across 25 gestures and 85 countries. Through systematic evaluation using MC-SIGNS, we uncover critical limitations: text-to-image (T2I) systems exhibit strong US-centric biases, performing better at detecting offensive gestures in US contexts than in non-US ones; large language models (LLMs) tend to over-flag gestures as offensive; and vision-language models (VLMs) default to US-based interpretations, frequently suggesting culturally inappropriate gestures. These findings highlight the urgent need for culturally-aware AI safety mechanisms to ensure equitable global deployment of AI technologies.

@inproceedings{yerukola2025gesture,
  title = { Mind the Gesture: Evaluating AI Sensitivity to Culturally Offensive Non-Verbal Gestures },
  author = {Yerukola, Akhila and Gabriel, Saadia and Peng, Nanyun and Sap, Maarten},
  year = { 2025 },
  booktitle = { Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) }
}

Details

SYNTHIA: Novel Concept Design with Affordance Composition

Hyeonjeong Ha, Xiaomeng Jin, Jeonghwan Kim, Jiateng Liu, Zhenhailong Wang, Khanh Duy Nguyen, Ansel Blume, Nanyun Peng, Kai-Wei Chang, and Heng Ji, in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , 2025 .
Full Text Abstract BibTeX Details

 Text-to-image (T2I) models enable rapid concept design, making them widely used in AI-driven design. While recent studies focus on generating semantic and stylistic variations of given design concepts, the integration of multiple affordances into a single coherent concept remains largely overlooked. In this paper, we introduce SYNTHIA, a framework for generating novel, functionally coherent designs based on desired affordances. Our approach leverages a hierarchical concept ontology that decomposes concepts into parts and affordances, serving as a crucial building block for functionally coherent design. We also develop a curriculum learning scheme based on our ontology that contrastively fine-tunes T2I models to progressively learn affordance composition while maintaining visual novelty. To elaborate, we (i) gradually increase affordance distance, guiding models from basic concept-affordance association to complex affordance compositions that integrate parts of distinct affordances into a single, coherent form, and (ii) enforce visual novelty by employing contrastive objectives to push learned representations away from existing concepts. Experimental results show that SYNTHIA outperforms state-of-the-art T2I models, demonstrating absolute gains of 25.1% and 14.7% for novelty and functional coherence in human evaluation, respectively.

@inproceedings{ha2025synthia,
  title = { SYNTHIA: Novel Concept Design with Affordance Composition },
  author = {Ha, Hyeonjeong and Jin, Xiaomeng and Kim, Jeonghwan and Liu, Jiateng and Wang, Zhenhailong and Nguyen, Khanh Duy and Blume, Ansel and Peng, Nanyun and Chang, Kai-Wei and Ji, Heng},
  year = { 2025 },
  booktitle = { Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) }
}

Details

Vulnerability of LLMs to Vertically Aligned Text Manipulations

Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Zhen Xiong, Nanyun Peng, and Kai-Wei Chang, in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , 2025 .
Full Text Abstract BibTeX Details

 Vertical text input is commonly encountered in various real-world applications, such as mathematical computations and word-based Sudoku puzzles. While current large language models (LLMs) have excelled in natural language tasks, they remain vulnerable to variations in text formatting. Recent research demonstrates that modifying input formats, such as vertically aligning words for encoder-based models, can substantially lower accuracy in text classification tasks. While easily understood by humans, these inputs can significantly mislead models, posing a potential risk of bypassing detection in real-world scenarios involving harmful or sensitive information. With the expanding application of LLMs, a crucial question arises: Do decoder-based LLMs exhibit similar vulnerabilities to vertically formatted text input? In this paper, we investigate the impact of vertical text input on the performance of various LLMs across multiple text classification datasets and analyze the underlying causes. Our findings are as follows: (i) Vertical text input significantly degrades the accuracy of LLMs in text classification tasks. (ii) Chain of Thought (CoT) reasoning does not help LLMs recognize vertical input or mitigate its vulnerability, but few-shot learning with careful analysis does. (iii) We explore the underlying cause of the vulnerability by analyzing the inherent issues in tokenization and attention matrices.

@inproceedings{li2025vulnerability,
  title = { Vulnerability of LLMs to Vertically Aligned Text Manipulations },
  author = {Li, Zhecheng and Wang, Yiwei and Hooi, Bryan and Cai, Yujun and Xiong, Zhen and Peng, Nanyun and Chang, Kai-Wei},
  year = { 2025 },
  booktitle = { Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) }
}

Details

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, and Aditya Grover, in Findings of the Association for Computational Linguistics: ACL 2025 , 2025 .
Full Text Abstract BibTeX Details

 A common technique for aligning large language models (LLMs) relies on acquiring human preferences by comparing multiple generations conditioned on a fixed context. This method, however, relies solely on pairwise comparisons, where the generations are evaluated within an identical context. While effective to such conditional preferences often fail to encompass the nuanced and multidimensional nature of human preferences. In this work, we revisit the traditional paradigm of preference acquisition and propose a new axis based on eliciting preferences jointly over the instruction-response pairs. Unlike prior preference optimizations, which are designed for conditional ranking protocols (e.g., DPO), we propose Joint Preference Optimization (JPO), a new preference optimization objective that upweights the joint probability of the chosen instruction-response pair over the rejected instruction-response pair. Interestingly, LLMs trained with joint instruction-response preference data using JPO outperform LLM trained with DPO by 5.2% and 3.3% win-rate for summarization and open-ended dialogue datasets, respectively. Our findings reveal that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs by tapping into a broader spectrum of human preference elicitation. We will release the data, code, and models upon acceptance.

@inproceedings{bansal2025jpo,
  title = { Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization },
  author = {Bansal, Hritik and Suvarna, Ashima and Bhatt, Gantavya and Peng, Nanyun and Chang, Kai-Wei and Grover, Aditya},
  year = { 2025 },
  booktitle = { Findings of the Association for Computational Linguistics: ACL 2025 }
}

Details

DRS: Deep Question Reformulation With Structured Output

Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Nanyun Peng, and Kai-Wei Chang, in Findings of the Association for Computational Linguistics: ACL 2025 , 2025 .
Full Text Abstract BibTeX Details

 Question answering represents a core capability of large language models (LLMs). However, when individuals encounter unfamiliar knowledge in texts, they often formulate questions that the text itself cannot answer due to insufficient understanding of the underlying information. Recent studies reveal that while LLMs can detect unanswerable questions, they struggle to assist users in reformulating these questions. Even advanced models like GPT-3.5 demonstrate limited effectiveness in this regard. To address this limitation, we propose DRS: Deep Question Reformulation with Structured Output, a novel zero-shot method aimed at enhancing LLMs’ ability to assist users in reformulating questions to extract relevant information from new documents. DRS combines the strengths of LLMs with a DFS-based algorithm to iteratively explore potential entity combinations and constrain outputs using predefined entities. This structured approach significantly enhances the reformulation capabilities of LLMs. Comprehensive experimental evaluations demonstrate that DRS improves the reformulation accuracy of GPT-3.5 from 23.03% to 70.42%, while also enhancing the performance of open-source models, such as Gemma2-9B, from 26.35% to 56.75%.

@inproceedings{li2025drs,
  title = { DRS: Deep Question Reformulation With Structured Output },
  author = {Li, Zhecheng and Wang, Yiwei and Hooi, Bryan and Cai, Yujun and Peng, Nanyun and Chang, Kai-Wei},
  year = { 2025 },
  booktitle = { Findings of the Association for Computational Linguistics: ACL 2025 }
}

Details

VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning

Xueqing Wu, Yuheng Ding, Bingxuan Li, Pan Lu, Da Yin, Kai-Wei Chang, and Nanyun Peng, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025.
Full Text Abstract BibTeX Details

The ability of large vision-language models (LVLMs) to critique and correct their reasoning is an essential building block towards self-improvement. However, a systematic analysis of such capabilities in LVLMs is still lacking. We propose VISCO, the first benchmark to extensively analyze fine-grained critique and correction. VISCO requires LVLMs to judge the correctness of \empheach step in a chain-of-thought and justify their decisions. Evaluating 24 LVLMs shows that human-written critiques markedly boost performance, whereas model-generated critiques can be unreliable. We identify three common failure patterns—poor visual-perception critique, reluctance to “say no,” and exaggerated error propagation—and introduce a \emphLookBack strategy that revisits the image to verify every claim, improving critique & correction accuracy by up to 13.5%.

@inproceedings{wu2025visco,
  author = {Wu, Xueqing and Ding, Yuheng and Li, Bingxuan and Lu, Pan and Yin, Da and Chang, Kai{-}Wei and Peng, Nanyun},
  title = {VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2025}
}

Details

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou, Mohsen Fayyaz, Pan Lu, Kai-Wei Chang, and Nanyun Peng, in Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025.
Full Text Abstract BibTeX Details

Existing multimodal retrieval benchmarks mainly test whether models can exploit \textittextual knowledge. Yet many real-world scenarios benefit more from retrieving \textitvisual information. We introduce MRAG-Bench, a retrieval-augmented generation benchmark covering 9 scenarios where images are superior to text. It contains 16 130 images and 1 353 multiple-choice questions. We evaluate 10 open-source and 4 proprietary LVLMs and find that every model gains more from image retrieval than text retrieval, confirming MRAG-Bench’s vision-centric nature. Even GPT-4o realizes only a 5.82% boost with ground-truth images versus 33.16% for humans, underscoring ample headroom for improving visual retrieval-augmented reasoning.

@inproceedings{hu2025mrag,
  author = {Hu, Wenbo and Gu, Jia{-}Chen and Dou, Zi{-}Yi and Fayyaz, Mohsen and Lu, Pan and Chang, Kai{-}Wei and Peng, Nanyun},
  title = {MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models},
  booktitle = {Proceedings of the Thirteenth International Conference on Learning Representations (ICLR)},
  year = {2025}
}

Details

REFFLY: Melody-Constrained Lyrics Editing Model

Songyan Zhao, Bingxuan Li, Yufei Tian, and Nanyun Peng, in Proceedings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025.
Full Text Abstract BibTeX Details

Automatic melody-to-lyric (M2L) generation seeks lyrics that fit a given tune. Prior systems generate from scratch, offering limited control and poor alignment. We present REFFLY—the first \emphrevision framework that edits arbitrary plain-text drafts into full, melody-aligned lyrics. Trained on a synthesized melody-aligned lyrics dataset and enhanced with training-free semantics- and musicality-preserving heuristics, REFFLY handles tasks such as flexible lyric drafting, song translation, and style transfer. Experiments show it outperforms strong baselines (Lyra and GPT-4) by 25% in both musicality and text quality.

@inproceedings{zhao2025reffly,
  author = {Zhao, Songyan and Li, Bingxuan and Tian, Yufei and Peng, Nanyun},
  title = {REFFLY: Melody-Constrained Lyrics Editing Model},
  booktitle = {Proceedings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year = {2025}
}

Details

Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks?

Xuan He, Da Yin, and Nanyun Peng, in Proceedings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025.
Full Text Abstract BibTeX Details

We study how \emphweak teachers—average human annotators or existing AI systems—can best supervise LLMs on hard reasoning tasks. Two strategies arise: (i) lower-quality supervision on tasks matching the target difficulty, and (ii) higher-quality supervision on easier subtasks. Surprisingly, even with outcome error rates as high as 90%, hard-task supervision can beat perfectly correct subtask supervision on multiple math benchmarks. A key driver is \emphstep-wise error rate: lowering step errors at equal outcome errors yields up to a 30% accuracy swing on MATH. Mixing hard-task and subtask data further boosts performance, suggesting promising data-augmentation directions.

@inproceedings{he2025guiding,
  author = {He, Xuan and Yin, Da and Peng, Nanyun},
  title = {Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks?},
  booktitle = {Proceedings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year = {2025}
}

Details

BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression

Yuankai Li, Jia-Chen Gu, Di Wu, Kai-Wei Chang, and Nanyun Peng, in Findings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-Findings), 2025.
Full Text Abstract BibTeX Details

Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge. However, as the number of retrieved documents increases, the input length to LLMs grows linearly, causing a dramatic increase in latency and a degradation in long-context understanding—especially for multi-hop questions that require reasoning across documents. We introduce \textbfBRIEF (Bridging Retrieval and Inference through Evidence Fusion), a lightweight approach that first compresses retrieved documents into dense, query-aware summaries and then feeds these into in-context RAG. We create synthetic training data by extracting atomic propositions from source documents, enabling learning of compression for multi-hop reasoning entirely with open-source tools. BRIEF produces far more concise summaries than prior methods and boosts open-domain QA: on HotpotQA, it doubles the compression rate while improving Flan-UL2 accuracy by +3.0% EM / +4.2% F1, and even surpasses GPT-3.5 with near-identical QA performance despite being markedly smaller.

@inproceedings{li2025brief,
  author = {Li, Yuankai and Gu, Jia-Chen and Wu, Di and Chang, Kai-Wei and Peng, Nanyun},
  title = {BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression},
  booktitle = {Findings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-Findings)},
  year = {2025}
}

Details

Evaluating Cultural and Social Awareness of LLM Web Agents

Haoyi Qiu, Alexander Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, and Chien-Sheng Wu, in Findings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-Findings), 2025.
Full Text Abstract BibTeX Details

With LLMs acting as web agents, robustness to cultural and social norms becomes critical. We release CASA, a benchmark covering two web tasks—online shopping and social forums—to test norm detection and response. CASA measures \emphawareness coverage, \emphhelpfulness, and \emphviolation rate. Current agents achieve <10% coverage and >40% violations, far worse than non-agent settings. Combining prompting with fine-tuning on culture-specific data yields complementary gains: fine-tuning improves cross-region generalization, while prompting helps navigate complex tasks. CASA thus spotlights the need for continual social-awareness evaluation during LLM-agent development.

@inproceedings{qiu2025evaluating,
  author = {Qiu, Haoyi and Fabbri, Alexander and Agarwal, Divyansh and Huang, Kung{-}Hsiang and Tan, Sarah and Peng, Nanyun and Wu, Chien{-}Sheng},
  title = {Evaluating Cultural and Social Awareness of LLM Web Agents},
  booktitle = {Findings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-Findings)},
  year = {2025}
}

Details

Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety

Yiwei Wang, Muhao Chen, Nanyun Peng, and Kai-Wei Chang, in Findings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-Findings), 2025.
Full Text Abstract BibTeX Details

Prior jailbreak studies mainly optimize the \emphcontent of adversarial snippets injected into prompts. We instead ask whether \emphwhere that snippet appears matters. We discover that placing a simple, human-readable adversarial string \emphat the very beginning of the output—an \textitoutput-prefix jailbreak—exposes safety vulnerabilities far more effectively than input-suffix or prompt-based jailbreaks. Directly forcing a user-specified output prefix dramatically increases attack success rates, revealing a positional weakness in existing LLM safety training.

@inproceedings{wang2025vulnerability,
  author = {Wang, Yiwei and Chen, Muhao and Peng, Nanyun and Chang, Kai-Wei},
  title = {Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety},
  booktitle = {Findings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-Findings)},
  year = {2025}
}

Details

Improving Faithfulness of Text-to-Image Diffusion Models through Inference Intervention

Danfeng Guo, Sanchit Agarwal, Yu-Hsiang Lin, Jiun-Yu Kao, Tagyoung Chung, Nanyun Peng, and Mohit Bansal, in Proceedings of The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025.
Full Text Abstract BibTeX Details

Text-to-image diffusion models excel at producing high-quality imagery yet often violate details in the text prompt. Existing fixes either fine-tune the backbone or apply gradient-based edits during inference—both costly and usually limited to narrow error types (e.g. object count). We propose an \emphintervention-and-correction pipeline that controls the denoising process without back-propagation. The model detects missing or incorrect objects mid-generation, constructs feedback layouts (optionally augmented via retrieval), rewinds to an earlier denoising step, and fuses corrected latents with the original ones. On VPEval and HRS-Bench, our method boosts faithfulness across object presence, count, scale and spatial-relation metrics, outperforming state-of-the-art GLIGEN by +6.7% average accuracy.

@inproceedings{guo2025faithfulness,
  author = {Guo, Danfeng and Agarwal, Sanchit and Lin, Yu-Hsiang and Kao, Jiun-Yu and Chung, Tagyoung and Peng, Nanyun and Bansal, Mohit},
  title = {Improving Faithfulness of Text-to-Image Diffusion Models through Inference Intervention},
  booktitle = {Proceedings of The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year = {2025}
}

Details

FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

Xuehai He, Jian Zheng, Jacob Zhiyuan Fang, Robinson Piramuthu, Mohit Bansal, Vicente Ordonez, Gunnar A. Sigurdsson, Nanyun Peng, and Xin Eric Wang, Transactions on Machine Learning Research (TMLR), 2025.
Full Text Abstract BibTeX Details

Controllable text-to-image (T2I) diffusion aims to respect both a text prompt and auxiliary semantic inputs (e.g. edge maps). Existing methods struggle when presented with \emphmultiple heterogeneous controls, incurring heavy computational cost and reduced fidelity. \textbfFlexEControl introduces a weight-decomposition strategy that unifies diverse controls with far fewer parameters. The approach trims trainable parameters by 41%, lowers memory by 30% relative to Uni-ControlNet, doubles data efficiency, and faithfully integrates multiple conditions of varying modality—all while keeping generation quality high.

@article{he2025flexecontrol,
  author = {He, Xuehai and Zheng, Jian and Fang, Jacob Zhiyuan and Piramuthu, Robinson and Bansal, Mohit and Ordonez, Vicente and Sigurdsson, Gunnar A and Peng, Nanyun and Wang, Xin Eric},
  title = {FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation},
  journal = {Transactions on Machine Learning Research (TMLR)},
  year = {2025}
}

Details

2024

Adaptable Logical Control for Large Language Models

Honghua Zhang, Po-Nien Kung, Masahiro Yoshida, Guy Van den Broeck, and Nanyun Peng, in Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024.
Full Text Abstract BibTeX Details

Despite the success of Large Language Models (LLMs) in performing various tasks with provided instructions, controlling model generation during inference poses a persistent challenge. In this paper, we introduce Ctrl-G, an adaptable framework that facilitates tractable and flexible control over LLM generation. Ctrl-G can combine any production-ready LLMs with a Hidden Markov Model (HMM), enabling output generation that adheres to logical constraints represented as deterministic finite automata (DFAs), including keyword control, length control, and insertion. Our study demonstrates that Ctrl-G, coupled with a TULU-2-7B model, outperforms GPT3.5 and GPT4 models in human evaluations for interactive text editing by 30% overall satisfaction rate, and exhibits high-quality generation with 100% constraint satisfaction. Additionally, our experiment on the Grade School Math (GSM) dataset highlights the potential of applying Ctrl-G beyond natural language generation (NLG) tasks. By guiding the reasoning process with logical constraints, we achieved a 3.4% improvement on the GSM subset, underscoring Ctrl-G’s broader applicability.

@inproceedings{zhang2024adaptable,
  title = {Adaptable Logical Control for Large Language Models},
  author = {Zhang, Honghua and Kung, Po-Nien and Yoshida, Masahiro and den Broeck, Guy Van and Peng, Nanyun},
  year = {2024},
  booktitle = {Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS)}
}

Details

SafeWorld: Geo-Diverse Safety Alignment

Da Yin, Haoyi Qiu, Kung-Hsiang Huang, Kai-Wei Chang, and Nanyun Peng, in Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024.
Abstract BibTeX Details

In the rapidly evolving field of Large Language Models (LLMs), ensuring safety is a crucial and widely discussed topic. However, existing works often overlook the geo-diversity of cultural and legal standards across the world. To reveal the challenges posed by geo-diverse safety standards, we introduce SafeWorld, a novel benchmark specifically designed to evaluate LLMs’ ability to generate responses that are not only helpful but also culturally sensitive and legally compliant across diverse global contexts. SafeWorld encompasses 2,775 test user queries, each grounded in high-quality, human-verified cultural norms and legal policies from 50 countries and 493 regions/races. On top of it, we propose a multi-dimensional automatic safety evaluation framework that assesses the contextual appropriateness, accuracy, and comprehensiveness of responses. Our evaluations reveal that current LLMs struggle to meet these criteria effectively. To enhance LLMs’ alignment with geo-diverse safety standards, we synthesize helpful preference pairs for Direct Preference Optimization (DPO) alignment. The preference pair construction aims to encourage LLMs to behave appropriately and provide precise references to relevant cultural norms and policies when necessary. Our trained SafeWorldLM outperforms all competing models, including GPT-4o on all the three evaluation dimensions by a large margin. Global human evaluators also note a nearly 20% higher winning rate in helpfulness and harmfulness evaluation.

@inproceedings{yin2024safeworld,
  title = {SafeWorld: Geo-Diverse Safety Alignment},
  author = {Yin, Da and Qiu, Haoyi and Huang, Kung-Hsiang and Chang, Kai-Wei and Peng, Nanyun},
  year = {2024},
  booktitle = {Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS)}
}

Details

Matryoshka Query Transformer for Large Vision-Language Models

Wenbo Hu, Zi-Yi Dou, Liunian Harold Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang, in Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024.
Full Text Code Abstract BibTeX Details

Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m ≤M latent query tokens and train the model using only these first m tokens, discarding the rest. Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLaVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA’s fixed 576. Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each. Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.

@inproceedings{hu2024mqt,
  title = {Matryoshka Query Transformer for Large Vision-Language Models},
  author = {Hu, Wenbo and Dou, Zi-Yi and Li, Liunian Harold and Kamath, Amita and Peng, Nanyun and Chang, Kai-Wei},
  year = {2024},
  booktitle = {Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS)},
  project_website = {https://gordonhu608.github.io/mqtllava/}
}

Details

DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation

Xueqing Wu, Rui Zheng, Jingzhen Sha, Te-Lin Wu, Hanyu Zhou, Tang Mohan, Kai-Wei Chang, Nanyun Peng, and Haoran Huang, in Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024.
Full Text Code Abstract BibTeX Details

Data analysis is a crucial analytical process essential for deriving insights from real-world databases. As shown in Figure 1, the need for data analysis typically arises from specific application scenarios, and requires diverse reasoning skills including mathematical reasoning, logical reasoning, and strategic reasoning. Existing work often focus on simple factual retrieval or arithmetic resolutions and thus are insufficient for addressing complex real-world queries. This work aims to propose new resources and benchmarks on this crucial yet challenging and under-explored task. Due to the prohibitively high cost of collecting expert annotations, we use large language models (LLMs) enhanced by code generation to automatically generate high-quality data analysis, which will later be refined by human annotators. We construct the DACO dataset, containing (1) 440 databases (of tabular data) collected from real-world scenarios, (2)  2k automatically generated query-answer pairs that can serve as weak supervision for model training, and (3) a concentrated but high-quality test set with human refined annotations that serves as our main evaluation benchmark. Experiments show that while LLMs like GPT-4 exhibit promising data analysis capabilities, they are still evaluated as less helpful than human-written analysis on 58.1% cases. Leveraging our weak supervision data, we experiment with various fine-tuning methods, including supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). Our trained model outperforms existing baselines for table question answering, and RLHF further boosts the helpfulness of generated analysis on 58.5% cases.

@inproceedings{wu2024daco,
  title = {DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation},
  author = {Wu, Xueqing and Zheng, Rui and Sha, Jingzhen and Wu, Te-Lin and Zhou, Hanyu and Mohan, Tang and Chang, Kai-Wei and Peng, Nanyun and Huang, Haoran},
  year = {2024},
  booktitle = {Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track}
}

Details

Measuring Psychological Depth in Language Models

Fabrice Y. Harel-Canada, Hanyu Zhou, Sreya Muppalla, Zeynep Senahan Yildiz, Miryung Kim, Amit Sahai, and Nanyun Peng, in Proceedings of The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.
Full Text Code Abstract BibTeX Details 🏆 Outstanding Paper Award (<0.4%)

Evaluations of creative stories generated by large language models (LLMs) often focus on objective properties of the text, such as its style, coherence, and diversity. While these metrics are indispensable, they do not speak to a story’s subjective, psychological impact from a reader’s perspective. We introduce the Psychological Depth Scale (PDS), a novel framework rooted in literary theory that measures an LLM’s ability to produce authentic and narratively complex stories that provoke emotion, empathy, and engagement. We empirically validate our framework by showing that humans can consistently evaluate stories based on PDS (0.72 Krippendorff’s alpha). We also explore techniques for automating the PDS to easily scale future analyses. GPT-4o, combined with a novel Mixture-of-Personas (MoP) prompting strategy, achieves an average Spearman correlation of 0.51 with human judgment while Llama-3-70B with constrained decoding scores as high as 0.68 for empathy. Finally, we compared the depth of stories authored by both humans and LLMs. Surprisingly, GPT-4 stories either surpassed or were statistically indistinguishable from highly-rated human-written stories sourced from Reddit. By shifting the focus from text to reader, the Psychological Depth Scale is a validated, automated, and systematic means of measuring the capacity of LLMs to connect with humans through the stories they tell.

@inproceedings{harel2024measuring,
  author = {Harel-Canada, Fabrice Y and Zhou, Hanyu and Muppalla, Sreya and Yildiz, Zeynep Senahan and Kim, Miryung and Sahai, Amit and Peng, Nanyun},
  title = {Measuring Psychological Depth in Language Models},
  booktitle = {Proceedings of The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2024}
}

Details

Are Large Language Models Capable of Generating Human-Level Narratives?

Yufei Tian, Tenghao Huang, Miri Liu, Derek Jiang, Alexander Spangher, Muhao Chen, Jonathan May, and Nanyun Peng, in Proceedings of The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.
Full Text Code Abstract BibTeX Details 🏆 Outstanding Paper Award (<0.4%)

This paper investigates the capability of LLMs in storytelling, focusing on narrative development and plot progression. We introduce a novel computational framework to analyze narratives through three discourse-level aspects: i) story arcs, ii) turning points, and iii) affective dimensions, including arousal and valence. By leveraging expert and automatic annotations, we uncover significant discrepancies between the LLM- and human- written stories. While human-written stories are suspenseful, arousing, and diverse in narrative structures, LLM stories are homogeneously positive and lack tension. Next, we measure narrative reasoning skills as a precursor to generative capacities, concluding that most LLMs fall short of human abilities in discourse understanding. Finally, we show that explicit integration of aforementioned discourse features can enhance storytelling, as is demonstrated by over 40% improvement in neural storytelling in terms of diversity, suspense, and arousal.

@inproceedings{tian2024are,
  author = {Tian, Yufei and Huang, Tenghao and Liu, Miri and Jiang, Derek and Spangher, Alexander and Chen, Muhao and May, Jonathan and Peng, Nanyun},
  title = {Are Large Language Models Capable of Generating Human-Level Narratives?},
  booktitle = {Proceedings of The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2024}
}

Details

Do LLMs Plan Like Human Writers? Comparing Journalist Coverage of Press Releases with LLMs

Alexander Spangher, Nanyun Peng, Sebastian Gehrmann, and Mark Dredze, in Proceedings of The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.
Abstract BibTeX Details 🏆 Outstanding Paper Award (<0.4%)

Journalists engage in multiple steps in the news writing process that depend on human creativity, like exploring different “angles” (i.e., story directions). These can potentially be aided by large language models (LLMs). By affecting planning decisions, such interventions can have an outsize impact on creative output. We advocate a careful approach to evaluating these interventions, to ensure alignment with human values, by comparing LLM decisions to previous human decisions. In a case study of journalistic coverage of press releases, we assemble a large dataset of 250k press releases and 650k human-written articles covering them. We develop methods to identify news articles that challenge and contextualize press releases. Finally, we evaluate suggestions made by LLMs for these articles and compare these with decisions made by human journalists.

@inproceedings{spangher2024llm_planning,
  author = {Spangher, Alexander and Peng, Nanyun and Gehrmann, Sebastian and Dredze, Mark},
  title = {Do LLMs Plan Like Human Writers? Comparing Journalist Coverage of Press Releases with LLMs},
  booktitle = {Proceedings of The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2024}
}

Details

Evaluating LLMs’ Capability in Satisfying Lexical Constraints

Bingxuan Li, Yiwei Wang, Tao Meng, Kai-Wei Chang, and Nanyun Peng, in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.
Abstract BibTeX Details 🏆 Best Paper Nomination (2%)

This paper analyzes the performance of LLMs in Lexical Constrained Generation (LCG) tasks, identifying key limitations and proposing the Divide and Conquer Generation strategy. Our approach significantly enhances LLMs’ success rate in satisfying lexical constraints across various tasks, providing insights into improving text generation applications.

@inproceedings{li2024evaluating,
  title = {Evaluating LLMs' Capability in Satisfying Lexical Constraints},
  author = {Li, Bingxuan and Wang, Yiwei and Meng, Tao and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2024}
}

Details

Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue

Jia-Chen Gu, Hao-Xiang Xu, Jun-Yu Ma, Pan Lu, Zhen-Hua Ling, Kai-Wei Chang, and Nanyun Peng, in Proceedings of The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.
Full Text Code Abstract BibTeX Details 🏆 Best Paper Nomination (2%)

Model editing is a technique that edits large language models (LLMs) with updated knowledge to alleviate hallucinations without resource-intensive retraining. This paper systematically analyzes the side effects of model editing methods and proposes a regularization method to address the overfitting. Our experiments show that it is challenging for current editing methods to improve factuality while maintaining general abilities. We propose RECT (RElative Change in weighT) to mitigate side effects, showing significant performance retention.

@inproceedings{gu2024model,
  author = {Gu, Jia-Chen and Xu, Hao-Xiang and Ma, Jun-Yu and Lu, Pan and Ling, Zhen-Hua and Chang, Kai-Wei and Peng, Nanyun},
  title = {Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue},
  booktitle = {Proceedings of The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2024}
}

Details

Explaining and Improving Contrastive Decoding by Extrapolating the Probabilities of a Huge and Hypothetical LM

Haw-Shiuan Chang, Nanyun Peng, Mohit Bansal, Anil Ramakrishna, and Tagyoung Chung, in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.
Abstract BibTeX Details 🏆 Best Paper Nomination (2%)

Contrastive decoding (CD) improves the next-token distribution of a large expert language model (LM) using a small amateur LM. This paper theoretically explains why CD works well and introduces a new method, Asymptotic Probability Decoding (APD), to overcome its limitations. Experiments show that APD significantly boosts factuality in open-ended text generation and achieves new state-of-the-art results across multiple datasets.

@inproceedings{chang2024contrastive,
  title = {Explaining and Improving Contrastive Decoding by Extrapolating the Probabilities of a Huge and Hypothetical LM},
  author = {Chang, Haw-Shiuan and Peng, Nanyun and Bansal, Mohit and Ramakrishna, Anil and Chung, Tagyoung},
  booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2024}
}

Details

Re-ReST: Reflection-Reinforced Self-Training for Language Agents

Zi-Yi Dou, Cheng-Fu Yang, Xueqing Wu, Kai-Wei Chang, and Nanyun Peng, in Proceedings of The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.
Full Text Code Abstract BibTeX Details

Finetuning language agents with reasoning-action trajectories is effective, but obtaining these trajectories from human annotations or stronger models is costly and sometimes impractical. In this paper, we investigate the use of self-training in language agents, which can generate supervision from the agent itself, offering a promising alternative without relying on human or stronger model demonstrations. Self-training, however, requires high-quality model-generated samples, which are hard to obtain for challenging language agent tasks. To address this, we present Reflection-Reinforced Self-Training (Re-ReST), which uses a reflector to refine low-quality generated samples during self-training. The reflector takes the agent’s output and feedback from an external environment to produce improved samples. We conduct extensive experiments on open-source language agents across tasks, demonstrating the effectiveness of self-training and Re-ReST in language agent tasks.

@inproceedings{dou2024rerest,
  author = {Dou, Zi-Yi and Yang, Cheng-Fu and Wu, Xueqing and Chang, Kai-Wei and Peng, Nanyun},
  title = {Re-ReST: Reflection-Reinforced Self-Training for Language Agents},
  booktitle = {Proceedings of The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2024},
  keywords = {agent}
}

Details

SPEED++: A Multilingual Event Extraction Framework for Epidemic Prediction and Preparedness

Tanmay Parekh, Jeffrey Kwan, Jiarui Yu, Sparsh Johri, Hyosang Ahn, Sreya Muppalla, Kai-Wei Chang, Wei Wang, and Nanyun Peng, in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.
Code Abstract BibTeX Details

We introduce SPEED++, the first multilingual Event Extraction framework for extracting epidemic-related information from social media. Our framework is capable of providing epidemic warnings in diverse languages and demonstrates the efficacy of zero-shot cross-lingual models trained on English data for extracting information relevant to various diseases.

@inproceedings{parekh2024speed,
  title = {SPEED++: A Multilingual Event Extraction Framework for Epidemic Prediction and Preparedness},
  author = {Parekh, Tanmay and Kwan, Jeffrey and Yu, Jiarui and Johri, Sparsh and Ahn, Hyosang and Muppalla, Sreya and Chang, Kai-Wei and Wang, Wei and Peng, Nanyun},
  booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2024}
}

Details

Synchronous Faithfulness Monitoring for Trustworthy Retrieval-Augmented Generation

Di Wu, Jia-Chen Gu, Fan Yin, Nanyun Peng, and Kai-Wei Chang, in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.
Full Text Code Abstract BibTeX Details

This paper proposes SynCheck, a lightweight monitor that detects unfaithful sentences in retrieval-augmented language models (RALMs). By integrating fine-grained decoding dynamics, SynCheck outperforms existing baselines in faithfulness detection. We also introduce FOD, a faithfulness-oriented decoding algorithm that significantly improves the faithfulness of long-form generation outputs.

@inproceedings{wu2024synchronous,
  title = {Synchronous Faithfulness Monitoring for Trustworthy Retrieval-Augmented Generation},
  author = {Wu, Di and Gu, Jia-Chen and Yin, Fan and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2024}
}

Details

QUDSELECT: Selective Decoding for Questions Under Discussion Parsing

Ashima Suvarna, Xiao Liu, Tanmay Parekh, Kai-Wei Chang, and Nanyun Peng, in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), short, 2024.
Abstract BibTeX Details

Question Under Discussion (QUD) is a discourse framework that uses implicit questions to reveal discourse relationships between sentences. In QUD parsing, each sentence is viewed as an answer to a question triggered by an anchor sentence in prior context. The resulting QUD structure is required to conform to several theoretical criteria, making QUD parsing a challenging task. We introduce QUDSELECT, a joint-training framework that selectively decodes the QUD dependency structures considering the QUD criteria. Our method outperforms state-of-the-art baseline models by 9% in human evaluation and 4% in automatic evaluation, demonstrating the effectiveness of our framework.

@inproceedings{suvarna2024qudselect,
  title = {QUDSELECT: Selective Decoding for Questions Under Discussion Parsing},
  author = {Suvarna, Ashima and Liu, Xiao and Parekh, Tanmay and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), short},
  year = {2024}
}

Details

Detecting Machine-Generated Long-Form Content with Latent-Space Variables

Yufei Tian, Zeyu Pan, and Nanyun Peng, in Proceedings of the Findings of ACL at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings), 2024.
Full Text Abstract BibTeX Details

We propose a robust method to detect machine-generated long-form text by incorporating abstract elements as key deciding factors, leading to a 31% improvement over existing baselines.

@inproceedings{tian2024detecting,
  author = {Tian, Yufei and Pan, Zeyu and Peng, Nanyun},
  title = {Detecting Machine-Generated Long-Form Content with Latent-Space Variables},
  booktitle = {Proceedings of the Findings of ACL at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings)},
  year = {2024}
}

Details

LLM-A*: Large Language Model Enhanced Incremental Heuristic Search on Path Planning

Silin Meng, Yiwei Wang, Cheng-Fu Yang, Nanyun Peng, and Kai-Wei Chang, in Proceedings of the Findings of ACL at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings), 2024.
Abstract BibTeX Details

Path planning is a fundamental scientific problem in robotics and autonomous navigation. We propose LLM-A*, a novel route planning method that combines the precise pathfinding capabilities of A* with the global reasoning capability of large language models (LLMs). This hybrid approach aims to enhance pathfinding efficiency in terms of time and space complexity while maintaining the integrity of path validity, especially in large-scale scenarios.

@inproceedings{meng2024llm,
  title = {LLM-A*: Large Language Model Enhanced Incremental Heuristic Search on Path Planning},
  author = {Meng, Silin and Wang, Yiwei and Yang, Cheng-Fu and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {Proceedings of the Findings of ACL at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings)},
  year = {2024}
}

Details

VDebugger: Harnessing Execution Feedback for Debugging Visual Programs

Xueqing Wu, Zongyu Lin, Songyan Zhao, Te-Lin Wu, Pan Lu, Nanyun Peng, and Kai-Wei Chang, in Proceedings of the Findings of ACL at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings), 2024.
Full Text Code Abstract BibTeX Details

Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and invoke specialized models for each step to solve the problems. However, these programs are prone to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debugging complex visual programs remains a major bottleneck for visual reasoning. To address this, we introduce VDebugger, a novel critic-refiner framework trained to localize and debug visual programs by tracking execution step by step. VDebugger identifies and corrects program errors leveraging detailed execution feedback, improving interpretability and accuracy. The training data is generated through an automated pipeline that injects errors into correct visual programs using a novel mask-best decoding technique. Evaluations on six datasets demonstrate VDebugger’s effectiveness, showing performance improvements of up to 3.2% in downstream task accuracy. Further studies show VDebugger’s ability to generalize to unseen tasks, bringing a notable improvement of 2.3% on the unseen COVR task.

@inproceedings{wu2024vdebugger,
  author = {Wu, Xueqing and Lin, Zongyu and Zhao, Songyan and Wu, Te-Lin and Lu, Pan and Peng, Nanyun and Chang, Kai-Wei},
  title = {VDebugger: Harnessing Execution Feedback for Debugging Visual Programs},
  booktitle = {Proceedings of the Findings of ACL at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings)},
  year = {2024}
}

Details

LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints

Thomas Palmeira Ferraz, Kartik Mehta, Yu-Hsiang Lin, Haw-Shiuan Chang, Shereen Oraby, Sijia Liu, Vivek Subramanian, Tagyoung Chung, Mohit Bansal, and Nanyun Peng, in Proceedings of the Findings of ACL at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings), 2024.
Abstract BibTeX Details

We investigate LLMs’ capability in following multi-constrained instructions, introducing the Decompose, Critique, and Refine (DeCRIM) self-correction pipeline. This approach significantly enhances the ability of LLMs to handle complex constraints, and our experiments demonstrate substantial improvements in instruction adherence across multiple evaluation metrics.

@inproceedings{ferraz2024llm,
  title = {LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints},
  author = {Ferraz, Thomas Palmeira and Mehta, Kartik and Lin, Yu-Hsiang and Chang, Haw-Shiuan and Oraby, Shereen and Liu, Sijia and Subramanian, Vivek and Chung, Tagyoung and Bansal, Mohit and Peng, Nanyun},
  booktitle = {Proceedings of the Findings of ACL at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings)},
  year = {2024}
}

Details

Explaining Mixtures of Sources in News Articles

Alexander Spangher, James Youn, Matt DeButts, Nanyun Peng, and Jonathan May, in Proceedings of the Findings of ACL at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings), 2024.
Abstract BibTeX Details

Human writers plan, then write. For large language models (LLMs) to play a role in longer-form article generation, we must understand the planning steps humans make before writing. We explore one kind of planning, source-selection in news, as a case-study for evaluating plans in long-form generation. We ask: why do specific stories call for specific kinds of sources? We imagine a process where sources are selected to fall into different categories. Learning the article’s plan means predicting the categorization scheme chosen by the journalist. Inspired by latent-variable modeling, we first develop metrics to select the most likely plan underlying a story. Then, working with professional journalists, we adapt five existing approaches to planning and introduce three new ones. We find that two approaches, or schemas: stance and social affiliation best explain source plans in most documents. However, other schemas like textual entailment explain source plans in factually rich topics like "Science". Finally, we find we can predict the most suitable schema given just the article’s headline with reasonable accuracy. We see this as an important case-study for human planning, and provides a framework and approach for evaluating other kinds of plans, like discourse or plot-oriented plans. We release a corpora, NewsSources, with schema annotations for 4M articles, for further study.

@inproceedings{spangher2024source_explaining,
  author = {Spangher, Alexander and Youn, James and DeButts, Matt and Peng, Nanyun and May, Jonathan},
  title = {Explaining Mixtures of Sources in News Articles},
  booktitle = {Proceedings of the Findings of ACL at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings)},
  year = {2024}
}

Details

Uncertainty Calibration for Tool-Using Language Agents

Hao Liu, Zi-Yi Dou, Yixin Wang, Nanyun Peng, and Yisong Yue, in Proceedings of the Findings of ACL at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings), 2024.
Abstract BibTeX Details

There is increasing interest in equipping language models with the ability to leverage external tools for complex, goal-oriented tasks. However, interacting with external tools introduces inherent uncertainties due to imperfections and misalignments between the tools’ outputs and the agents’ internal models, often leading to suboptimal outcomes. We thus study the problem of tool-use calibration in language agents, and identify prompt design and execution trace selection as two primary areas that suffer from miscalibration. We then propose ProbeCal, which recalibrates the internal probabilities of tool-using language agents to better reflect the actual effectiveness of the tool, and enables a more appropriate selection of prompts and execution paths. We empirically show that ProbeCal can significantly and consistently improve off-the-shelf language models in tool-using applications.

@inproceedings{liu2024uncertainty_calibration,
  author = {Liu, Hao and Dou, Zi-Yi and Wang, Yixin and Peng, Nanyun and Yue, Yisong},
  title = {Uncertainty Calibration for Tool-Using Language Agents},
  booktitle = {Proceedings of the Findings of ACL at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings)},
  year = {2024}
}

Details

PG-Story: Taxonomy, Dataset, and Evaluation for Ensuring Child-Safe Content for Story Generation

Alicia Y. Tsai, Shereen Oraby, Anjali Narayan-Chen, Alessandra Cervone, Spandana Gella, Apurv Verma, Tagyoung Chung, Jing Huang, and Nanyun Peng, in 4th Workshop on NLP for Positive Impact at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.
Full Text Abstract BibTeX Details 🏆 Outstanding Paper Award

Children’s story generation systems must be both engaging and age-appropriate, yet existing language-model–based systems often produce violent, profane, or biased content.  PG-Story introduces (i) a safety-focused taxonomy tailored to children’s text, and (ii) a dataset annotated at sentence and discourse level for unsafe elements.  Using PG-Story, the authors show how self-diagnosis plus controllable decoding can markedly reduce unsafe content in generated stories.

@inproceedings{tsai2024pgstory,
  author = {Tsai, Alicia Y. and Oraby, Shereen and Narayan-Chen, Anjali and Cervone, Alessandra and Gella, Spandana and Verma, Apurv and Chung, Tagyoung and Huang, Jing and Peng, Nanyun},
  title = {PG-Story: Taxonomy, Dataset, and Evaluation for Ensuring Child-Safe Content for Story Generation},
  booktitle = {4th Workshop on NLP for Positive Impact at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2024}
}

Details

ARMADA: Attribute-Based Multimodal Data Augmentation

Xiaomeng Jin, Jeonghwan Kim, Yu Zhou, Kuan-Hao Huang, Te-Lin Wu, Nanyun Peng, and Heng Ji, in Workshop on WikiNLP: Advancing Natural Language Process for Wikipedia at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.
Full Text Abstract BibTeX Details

Manual curation of high-quality image–text pairs for multimodal language models is expensive.  ARMADA augments such data by (1) extracting entities and their visual attributes from text, (2) substituting those attributes with KB- and LLM-guided alternatives, and (3) editing the original image accordingly.  The resulting knowledge-grounded, semantically consistent pairs boost model performance on four downstream tasks, demonstrating the value of attribute-level, KB-aware augmentation.

@inproceedings{jin2024armada,
  author = {Jin, Xiaomeng and Kim, Jeonghwan and Zhou, Yu and Huang, Kuan-Hao and Wu, Te-Lin and Peng, Nanyun and Ji, Heng},
  title = {ARMADA: Attribute-Based Multimodal Data Augmentation},
  booktitle = {Workshop on WikiNLP: Advancing Natural Language Process for Wikipedia at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2024}
}

Details

Open-Domain Text Evaluation via Contrastive Distribution Methods

Sidi Lu, Hongyi Liu, Asli Celikyilmaz, Tianlu Wang, and Nanyun Peng, in Proceedings of the Fortieth International Conference on Machine Learning (ICML), 2024.
Full Text BibTeX Details

@inproceedings{lu2024cdm,
  title = {Open-Domain Text Evaluation via Contrastive Distribution Methods},
  author = {Lu, Sidi and Liu, Hongyi and Celikyilmaz, Asli and Wang, Tianlu and Peng, Nanyun},
  booktitle = {Proceedings of the Fortieth International Conference on Machine Learning (ICML)},
  year = {2024}
}

Details

On Prompt-Driven Safeguarding for Large Language Models

Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng, in Proceedings of the Fortieth International Conference on Machine Learning (ICML), 2024.
Full Text BibTeX Details

@inproceedings{zheng2024dro,
  title = {On Prompt-Driven Safeguarding for Large Language Models},
  author = {Zheng, Chujie and Yin, Fan and Zhou, Hao and Meng, Fandong and Zhou, Jie and Chang, Kai-Wei and Huang, Minlie and Peng, Nanyun},
  booktitle = {Proceedings of the Fortieth International Conference on Machine Learning (ICML)},
  year = {2024}
}

Details

DiNADO: Norm-Disentangled Neurally-Decomposed Oracles for Controlling Language Models

Sidi Lu, Wenbo Zhao, Chenyang Tao, Arpit Gupta, Shanchan Wu, Tagyoung Chung, and Nanyun Peng, in Proceedings of the Fortieth International Conference on Machine Learning (ICML), 2024.
BibTeX Details

@inproceedings{lu2024nado2,
  title = {DiNADO: Norm-Disentangled Neurally-Decomposed Oracles for Controlling Language Models},
  author = {Lu, Sidi and Zhao, Wenbo and Tao, Chenyang and Gupta, Arpit and Wu, Shanchan and Chung, Tagyoung and Peng, Nanyun},
  booktitle = {Proceedings of the Fortieth International Conference on Machine Learning (ICML)},
  year = {2024}
}

Details

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, and Nanyun Peng, in Proceedings of the Fortieth International Conference on Machine Learning (ICML), 2024.
Full Text BibTeX Details

@inproceedings{wadhawan2024contextual,
  title = {ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models},
  author = {Wadhawan, Rohan and Bansal, Hritik and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {Proceedings of the Fortieth International Conference on Machine Learning (ICML)},
  year = {2024}
}

Details

Improving Event Definition Following For Zero-Shot Event Detection

Zefan Cai, Po-Nien Kung, Ashima Suvarna, Mingyu Derek Ma, Hritik Bansal, Baobao Chang, P. Jeffrey Brantingham, Wei Wang, and Nanyun Peng, in Proceedings of The 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024.
BibTeX Details

@inproceedings{cai2024improving,
  title = {Improving Event Definition Following For Zero-Shot Event Detection},
  author = {Cai, Zefan and Kung, Po-Nien and Suvarna, Ashima and Ma, Mingyu Derek and Bansal, Hritik and Chang, Baobao and Brantingham, P. Jeffrey and Wang, Wei and Peng, Nanyun},
  booktitle = {Proceedings of The 62nd Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2024}
}

Details

Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

Haoyi Qiu, Wenbo Hu, Zi-Yi Dou, and Nanyun Peng, in Findings of the Association for Computational Linguistics: ACL (ACL-findings), 2024.
Full Text Code BibTeX Details

@inproceedings{Qiu2024,
  title = {Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models},
  author = {Qiu, Haoyi and Hu, Wenbo and Dou, Zi-Yi and Peng, Nanyun},
  booktitle = {Findings of the Association for Computational Linguistics: ACL (ACL-findings)},
  year = {2024},
  project_website = {https://gordonhu608.github.io/VALOR-Eval/}
}

Details

Argument-Aware Approach To Event Linking

I.-Hung Hsu, Zihan Xue, Nilay Pochhi, Sahil Bansal, Prem Natarajan, Jayanth Srinivasa, and Nanyun Peng, in Findings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL-Findings), 2024.
BibTeX Details

@inproceedings{hsu2024evelink,
  title = {Argument-Aware Approach To Event Linking},
  author = {Hsu, I-Hung and Xue, Zihan and Pochhi, Nilay and Bansal, Sahil and Natarajan, Prem and Srinivasa, Jayanth and Peng, Nanyun},
  booktitle = {Findings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL-Findings)},
  year = {2024}
}

Details

Tracking the Newsworthiness of Public Documents

Alexander Spangher, Serdar Tumgoren, Ben Welsh, Nanyun Peng, Emilio Ferrara, and Jonathan May, in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024.
BibTeX Details

@inproceedings{Spangher2024,
  title = {Tracking the Newsworthiness of Public Documents},
  author = {Spangher, Alexander and Tumgoren, Serdar and Welsh, Ben and Peng, Nanyun and Ferrara, Emilio and May, Jonathan},
  booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2024}
}

Details

TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction

Kuan-Hao Huang, I.-Hung Hsu, Tanmay Parekh, Zhiyu Xie, Zixuan Zhang, Prem Natarajan, Kai-Wei Chang, Nanyun Peng, and Heng Ji, in Findings of the Association for Computational Linguistics: ACL (ACL-findings), 2024.
BibTeX Details

@inproceedings{Huang2024,
  title = {TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction},
  author = {Huang, Kuan-Hao and Hsu, I-Hung and Parekh, Tanmay and Xie, Zhiyu and Zhang, Zixuan and Natarajan, Prem and Chang, Kai-Wei and Peng, Nanyun and Ji, Heng},
  booktitle = {Findings of the Association for Computational Linguistics: ACL (ACL-findings)},
  year = {2024}
}

Details

CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation

I.-Hung Hsu, Zifeng Wang, Long Le, Lesly Miculicich, Nanyun Peng, Chen-Yu Lee, and Tomas Pfister, in Findings of the Association for Computational Linguistics: ACL (ACL-findings), 2024.
BibTeX Details

@inproceedings{Hsu2024b,
  title = {CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation},
  author = {Hsu, I-Hung and Wang, Zifeng and Le, Long and Miculicich, Lesly and Peng, Nanyun and Lee, Chen-Yu and Pfister, Tomas},
  booktitle = {Findings of the Association for Computational Linguistics: ACL (ACL-findings)},
  year = {2024}
}

Details

MacGyver: Are Large Language Models Creative Problem Solvers?

Yufei Tian, Abhilasha Ravichander, Lianhui Qin, Ronan Le Bras, Raja Marjieh, Nanyun Peng, Yejin Choi, Thomas L. Griffiths, and Faeze Brahman, in Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024.
Full Text BibTeX Details 🏆 Best Paper Nomination

@inproceedings{tian2024macgyver,
  title = {MacGyver: Are Large Language Models Creative Problem Solvers?},
  author = {Tian, Yufei and Ravichander, Abhilasha and Qin, Lianhui and Bras, Ronan Le and Marjieh, Raja and Peng, Nanyun and Choi, Yejin and Griffiths, Thomas L. and Brahman, Faeze},
  booktitle = {Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year = {2024}
}

Details

AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation

Haoyi Qiu, Kung-Hsiang Huang, Jingnong Qu, and Nanyun Peng, in Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024.
Full Text Code BibTeX Details

@inproceedings{qiu2024amrfact,
  title = {AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation},
  author = {Qiu, Haoyi and Huang, Kung-Hsiang and Qu, Jingnong and Peng, Nanyun},
  booktitle = {Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year = {2024}
}

Details

Contextual Label Projection for Cross-Lingual Structured Prediction

Tanmay Parekh, I.-Hung Hsu, Kuan-Hao Huang, Kai-Wei Chang, and Nanyun Peng, in Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024.
Full Text Code BibTeX Details 🏆 Best Paper Nomination

@inproceedings{parekh2024clap,
  title = {Contextual Label Projection for Cross-Lingual Structured Prediction},
  author = {Parekh, Tanmay and Hsu, I-Hung and Huang, Kuan-Hao and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year = {2024}
}

Details

Event Detection from Social Media for Epidemic Prediction

Tanmay Parekh, Anh Mac, Jiarui Yu, Yuxuan Dong, Syed Shahriar, Bonnie Liu, Eric J. Yang, Kuan-Hao Huang, Wei Wang, Nanyun Peng, and Kai-Wei Chang, in Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024.
Full Text Code BibTeX Details

@inproceedings{parekh2024pipp,
  title = {Event Detection from Social Media for Epidemic Prediction},
  author = {Parekh, Tanmay and Mac, Anh and Yu, Jiarui and Dong, Yuxuan and Shahriar, Syed and Liu, Bonnie and Yang, Eric J and Huang, Kuan-Hao and Wang, Wei and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year = {2024}
}

Details

Mitigating Bias for Question Answering Models by Tracking Bias Influence

Mingyu Derek Ma, Jiun-Yu Kao, Arpit Gupta, Yu-Hsiang Lin, Wenbo Zhao, Tagyoung Chung, Wei Wang, Kai-Wei Chang, and Nanyun Peng, in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024.
Full Text BibTeX Details

@inproceedings{ma2024bias,
  title = {Mitigating Bias for Question Answering Models by Tracking Bias Influence},
  author = {Ma, Mingyu Derek and Kao, Jiun-Yu and Gupta, Arpit and Lin, Yu-Hsiang and Zhao, Wenbo and Chung, Tagyoung and Wang, Wei and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year = {2024}
}

Details

Human-in-the-Loop Synthetic Text Data Inspection with Provenance Tracking

Hong Jin Kang*, Fabrice Y. Harel-Canada*, Muhammad Ali Gulzar, Nanyun Peng, and Miryung Kim, in Findings of Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-Findings), 2024.
Full Text BibTeX Details

@inproceedings{kang2024hitl,
  title = {Human-in-the-Loop Synthetic Text Data Inspection with Provenance Tracking},
  author = {Kang*, Hong Jin and Harel-Canada*, Fabrice Y and Gulzar, Muhammad Ali and Peng, Nanyun and Kim, Miryung},
  booktitle = {Findings of Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-Findings)},
  year = {2024}
}

Details

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, and Aditya Grover, in Data-centric Machine Learning Research (DMLR) Workshop at The International Conference on Machine Learning (ICML), 2024.
Full Text Abstract BibTeX Details

Human-preference alignment typically relies on pairwise comparisons of generations given a fixed prompt.  The authors propose \emphJoint Preference Optimization (JPO), which instead collects preferences over \emphwhole instruction–response pairs and optimizes the joint probability of a chosen pair over a rejected one.  Training LLMs with JPO yields win-rate gains of 5.2% on summarization and 3.3% on open-ended dialogue versus the popular DPO baseline, showing that joint preferences capture richer alignment signals.

@inproceedings{bansal2024alignment,
  author = {Bansal, Hritik and Suvarna, Ashima and Bhatt, Gantavya and Peng, Nanyun and Chang, Kai-Wei and Grover, Aditya},
  title = {Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization},
  booktitle = {Data-centric Machine Learning Research (DMLR) Workshop at The International Conference on Machine Learning (ICML)},
  year = {2024}
}

Details

PhonologyBench: Evaluating Phonological Skills of Large Language Models

Ashima Suvarna, Harshita Khandelwal, and Nanyun Peng, in Workshop Towards Knowledgeable Language Models at The 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024.
Full Text Abstract BibTeX Details

Phonological competence—grapheme-to-phoneme mapping, syllable counting, rhyme generation—is under-explored in LLM research.  PhonologyBench provides three English diagnostic tasks targeting these skills.  Despite never seeing speech data, leading LLMs show promising performance, yet still trail humans by 17% (rhyme generation) and 45% (syllable counting).  Results urge greater attention to phonology when deploying LLMs in speech-related applications.

@inproceedings{suvarna2024phonologybench,
  author = {Suvarna, Ashima and Khandelwal, Harshita and Peng, Nanyun},
  title = {PhonologyBench: Evaluating Phonological Skills of Large Language Models},
  booktitle = {Workshop Towards Knowledgeable Language Models at The 62nd Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2024}
}

Details

RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment

Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, and Yuandong Tian, in Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024.
Full Text BibTeX Details

@inproceedings{yang2024rlcd,
  title = {RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment},
  author = {Yang, Kevin and Klein, Dan and Celikyilmaz, Asli and Peng, Nanyun and Tian, Yuandong},
  booktitle = {Proceedings of the Twelfth International Conference on Learning Representations (ICLR)},
  year = {2024}
}

Details

STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models

Mingyu Derek Ma, Xiaoxuan Wang, Po-Nien Kung, P. Jeffrey Brantingham, Nanyun Peng, and Wei Wang, in Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI), 2024.
Full Text BibTeX Details

@inproceedings{ma2024star,
  title = {STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models},
  author = {Ma, Mingyu Derek and Wang, Xiaoxuan and Kung, Po-Nien and Brantingham, P. Jeffrey and Peng, Nanyun and Wang, Wei},
  booktitle = {Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI)},
  year = {2024}
}

Details

MIDDAG: Where Does Our News Go? Investigating Information Diffusion via Community-Level Information Pathways

Mingyu Derek Ma, Alexander K. Taylor, Nuan Wen, Yanchen Lin, Po-Nien Kung, Wenna Qin, Shicheng Wen, Azure Zhou, Diyi Yang, Xuezhe Ma, Nanyun Peng, and Wei Wang, in Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI), Demonstration Track, 2024.
Full Text BibTeX Details

@inproceedings{ma2024middag,
  title = {MIDDAG: Where Does Our News Go? Investigating Information Diffusion via Community-Level Information Pathways},
  author = {Ma, Mingyu Derek and Taylor, Alexander K. and Wen, Nuan and Lin, Yanchen and Kung, Po-Nien and Qin, Wenna and Wen, Shicheng and Zhou, Azure and Yang, Diyi and Ma, Xuezhe and Peng, Nanyun and Wang, Wei},
  booktitle = {Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI), Demonstration Track},
  year = {2024}
}

Details

2023

Harnessing Black-Box Control to Boost Commonsense in LMs’ Generation

Yufei Tian, Felix Zhang, and Nanyun Peng, in The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
Full Text BibTeX Details

@inproceedings{tian2023harnessing,
  title = {Harnessing Black-Box Control to Boost Commonsense in LMs’ Generation},
  author = {Tian, Yufei and Zhang, Felix and Peng, Nanyun},
  booktitle = {The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2023}
}

Details

Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks

Po-Nien Kung, Fan Yin, Di Wu, Kai-Wei Chang, and Nanyun Peng, in The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
Full Text Poster BibTeX Details

@inproceedings{kung2023active,
  title = {Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks},
  author = {Kung, Po-Nien and Yin, Fan and Wu, Di and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2023}
}

Details

Gender Biases in Automatic Evaluation Metrics for Image Captioning

Haoyi Qiu, Zi-Yi Dou, Tianlu Wang, Asli Celikyilmaz, and Nanyun Peng, in The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
Full Text Code BibTeX Details

@inproceedings{qiu2023gender,
  title = {Gender Biases in Automatic Evaluation Metrics for Image Captioning},
  author = {Qiu, Haoyi and Dou, Zi-Yi and Wang, Tianlu and Celikyilmaz, Asli and Peng, Nanyun},
  booktitle = {The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2023}
}

Details

Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge

Te-Lin Wu*, Yu Zhou*, and Nanyun Peng, in The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
Full Text Poster Video Code BibTeX Details

@inproceedings{wu2023localizing,
  title = {Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge},
  author = {Wu*, Te-Lin and Zhou*, Yu and Peng, Nanyun},
  booktitle = {The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2023}
}

Details

ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos

Te-Lin Wu*, Zi-Yi Dou*, Qingyuan Hu*, Yu Hou, Nischal Reddy Chandra, Marjorie Freedman, Ralph Weischedel, and Nanyun Peng, in The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
Full Text BibTeX Details

@inproceedings{wu2023acquired,
  title = {ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos},
  author = {Wu*, Te-Lin and Dou*, Zi-Yi and Hu*, Qingyuan and Hou, Yu and Chandra, Nischal Reddy and Freedman, Marjorie and Weischedel, Ralph and Peng, Nanyun},
  booktitle = {The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2023}
}

Details

Evaluating Large Language Models on Controlled Generation Tasks

Jiao Sun, Yufei Tian, Wangchunshu Zhou, Nan Xu, Qian Hu, Rahul Gupta, John Frederick Wieting, Nanyun Peng, and Xuezhe Ma, in The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
Full Text BibTeX Details

@inproceedings{sun2023eval,
  title = {Evaluating Large Language Models on Controlled Generation Tasks},
  author = {Sun, Jiao and Tian, Yufei and Zhou, Wangchunshu and Xu, Nan and Hu, Qian and Gupta, Rahul and Wieting, John Frederick and Peng, Nanyun and Ma, Xuezhe},
  booktitle = {The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2023}
}

Details

Identifying Informational Sources in News Articles

Alexander Spangher, Nanyun Peng, Emilio Ferrara, and Jonathan May, in The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
Full Text BibTeX Details

@inproceedings{spangher2023identifying,
  title = {Identifying Informational Sources in News Articles},
  author = {Spangher, Alexander and Peng, Nanyun and Ferrara, Emilio and May, Jonathan},
  booktitle = {The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2023}
}

Details

“Kelly is a Warm Person, Joseph is a Role Model”: Gender Biases in LLM-Generated Reference Letters

Yixin Wan, George Pu, Jiao Sun, Aparna Garimella, Kai-Wei Chang, and Nanyun Peng, in Findings of The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings), 2023.
Full Text BibTeX Details

@inproceedings{wan2023kelly,
  title = {“Kelly is a Warm Person, Joseph is a Role Model”: Gender Biases in LLM-Generated Reference Letters},
  author = {Wan, Yixin and Pu, George and Sun, Jiao and Garimella, Aparna and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {Findings of The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings)},
  year = {2023}
}

Details

Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems

Yixin Wan, Jieyu Zhao, Aman Chadha, Nanyun Peng, and Kai-Wei Chang, in Findings of The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings), 2023.
Full Text BibTeX Details

@inproceedings{wan2023personalized,
  title = {Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems},
  author = {Wan, Yixin and Zhao, Jieyu and Chadha, Aman and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {Findings of The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings)},
  year = {2023}
}

Details

DesCo: Learning Object Recognition with Rich Language Descriptions

Liunian Harold Li*, Zi-Yi Dou*, Nanyun Peng, and Kai-Wei Chang, in The 2023 Conference on Neural Information Processing Systems (NeurIPS), 2023.
Full Text BibTeX Details

@inproceedings{li2023desco,
  title = {DesCo: Learning Object Recognition with Rich Language Descriptions},
  author = {Li*, Liunian Harold and Dou*, Zi-Yi and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {The 2023 Conference on Neural Information Processing Systems (NeurIPS)},
  year = {2023}
}

Details

Masked Path Modeling for Vision-and-Language Navigation

Zi-Yi Dou, Feng Gao, and Nanyun Peng, in Findings of The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings), 2023.
Full Text BibTeX Details

@inproceedings{dou2023mpm,
  title = {Masked Path Modeling for Vision-and-Language Navigation},
  author = {Dou, Zi-Yi and Gao, Feng and Peng, Nanyun},
  booktitle = {Findings of The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings)},
  year = {2023}
}

Details

Parameter-Efficient Low-Resource Dialogue State Tracking by Prompt Tuning

Mingyu Derek Ma, Jiun-Yu Kao, Shuyang Gao, Arpit Gupta, Di Jin, Tagyoung Chung, and Nanyun Peng, in Proceedings of INTERSPEECH 2023, 2023.
Full Text BibTeX Details

@inproceedings{ma2023parameter,
  title = {Parameter-Efficient Low-Resource Dialogue State Tracking by Prompt Tuning},
  author = {Ma, Mingyu Derek and Kao, Jiun-Yu and Gao, Shuyang and Gupta, Arpit and Jin, Di and Chung, Tagyoung and Peng, Nanyun},
  booktitle = {Proceedings of INTERSPEECH 2023},
  year = {2023}
}

Details

LEAF: Linguistically Enhanced Event Temporal Relation Framework

Stanley Lim, Da Yin, and Nanyun Peng, in Workshop for Pattern-based Approaches to NLP in the Age of Deep Learning (PAN-DL) at EMNLP, 2023.
BibTeX Details 🏆 Best Paper Award

@inproceedings{lim2023leaf,
  title = {LEAF: Linguistically Enhanced Event Temporal Relation Framework},
  author = {Lim, Stanley and Yin, Da and Peng, Nanyun},
  booktitle = {Workshop for Pattern-based Approaches to NLP in the Age of Deep Learning (PAN-DL) at EMNLP},
  year = {2023}
}

Details

AMPERE: AMR-Aware Prefix for Generation-Based Event Argument Extraction Model

I.-Hung Hsu*, Zhiyu Xie*, Kuan-Hao Huang, Premkumar Natarajan, and Nanyun Peng, in Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
Full Text Poster Video Code BibTeX Details

@inproceedings{hsu2023ampere,
  title = {AMPERE: AMR-Aware Prefix for Generation-Based Event Argument Extraction Model},
  author = {Hsu*, I-Hung and Xie*, Zhiyu and Huang, Kuan-Hao and Natarajan, Premkumar and Peng, Nanyun},
  booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2023}
}

Details

Learning Action Conditions from Instructional Manuals for Instruction Understanding

Te-Lin Wu, Caiqi Zhang, Qingyuan Hu, Alex Spangher, and Nanyun Peng, in Proceedings of the Conference of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
Full Text Abstract BibTeX Details

The ability to infer pre- and postconditions of an action is vital for comprehending complex instructions, and is essential for applications such as autonomous instruction-guided agents and assistive AI that supports humans to perform physical tasks. In this work, we propose a task dubbed action condition inference, which extracts mentions of preconditions and postconditions of actions in instructional manuals. We propose a weakly supervised approach utilizing automatically constructed large-scale training instances from online instructions, and curate a densely human-annotated and validated dataset to study how well the current NLP models do on the proposed task. We design two types of models differ by whether contextualized and global information is leveraged, as well as various combinations of heuristics to construct the weak supervisions. Our experiments show a > 20% F1-score improvement with considering the entire instruction contexts and a > 6% F1-score benefit with the proposed heuristics. However, the best performing model is still well-behind human performance.

@inproceedings{wu2023action,
  title = {Learning Action Conditions from Instructional Manuals for Instruction Understanding},
  author = {Wu, Te-Lin and Zhang, Caiqi and Hu, Qingyuan and Spangher, Alex and Peng, Nanyun},
  booktitle = {Proceedings of the Conference of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2023}
}

Details

ACCENT: An Automatic Event Commonsense Evaluation Metric for Open-Domain Dialogue Systems

Sarik Ghazarian*, Yijia Shao*, Rujun Han, Aram Galstyan, and Nanyun Peng, in Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
Full Text BibTeX Details

@inproceedings{ghazarian2023accent,
  title = {ACCENT: An Automatic Event Commonsense Evaluation Metric for Open-Domain Dialogue Systems},
  author = {Ghazarian*, Sarik and Shao*, Yijia and Han, Rujun and Galstyan, Aram and Peng, Nanyun},
  booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2023}
}

Details

GENEVA: Benchmarking Generalizability for Event Argument Extraction with Hundreds of Event Types and Argument Roles

Tanmay Parekh, I.-Hung Hsu, Kuan-Hao Huang, Kai-Wei Chang, and Nanyun Peng, in Proceedings of the Conference of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
Full Text Slides Code BibTeX Details

@inproceedings{parekh2023geneva,
  title = {GENEVA: Benchmarking Generalizability for Event Argument Extraction with Hundreds of Event Types and Argument Roles},
  author = {Parekh, Tanmay and Hsu, I-Hung and Huang, Kuan-Hao and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {Proceedings of the Conference of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2023}
}

Details

Unsupervised Melody-to-Lyric Generation

Yufei Tian, Anjali Narayan-Chen, Shereen Oraby, Alessandra Cervone, Gunnar Sigurdsson, Chenyang Tao, Wenbo Zhao, Tagyoung Chung, Jing Huang, and Nanyun Peng, in Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
Full Text Slides BibTeX Details

@inproceedings{tian2023lyric,
  title = {Unsupervised Melody-to-Lyric Generation},
  author = {Tian, Yufei and Narayan-Chen, Anjali and Oraby, Shereen and Cervone, Alessandra and Sigurdsson, Gunnar and Tao, Chenyang and Zhao, Wenbo and Chung, Tagyoung and Huang, Jing and Peng, Nanyun},
  booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2023}
}

Details

Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning

Po-Nien Kung and Nanyun Peng, in Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL), short, 2023.
Full Text Poster BibTeX Details

@inproceedings{kung2023models,
  title = {Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning},
  author = {Kung, Po-Nien and Peng, Nanyun},
  booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL), short},
  year = {2023}
}

Details

DICE: Data-Efficient Clinical Event Extraction with Generative Models

Mingyu Derek Ma, Alexander K. Taylor, Wei Wang, and Nanyun Peng, in Proceedings of the Conference of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
Full Text Code BibTeX Details

@inproceedings{ma2023dice,
  title = {DICE: Data-Efficient Clinical Event Extraction with Generative Models},
  author = {Ma, Mingyu Derek and Taylor, Alexander K. and Wang, Wei and Peng, Nanyun},
  booktitle = {Proceedings of the Conference of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2023}
}

Details

TAGPRIME: A Unified Framework for Relational Structure Extraction

I.-Hung Hsu*, Kuan-Hao Huang*, Shuning Zhang, Wenxing Cheng, Premkumar Natarajan, Kai-Wei Chang, and Nanyun Peng, in Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
Full Text Code BibTeX Details

@inproceedings{hsu2023tagprime,
  title = {TAGPRIME: A Unified Framework for Relational Structure Extraction},
  author = {Hsu*, I-Hung and Huang*, Kuan-Hao and Zhang, Shuning and Cheng, Wenxing and Natarajan, Premkumar and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2023}
}

Details

DOC: Improving Long Story Coherence With Detailed Outline Control

Kevin Yang, Dan Klein, Nanyun Peng, and Yuandong Tian, in Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
Full Text BibTeX Details

@inproceedings{yang2023doc,
  title = {DOC: Improving Long Story Coherence With Detailed Outline Control},
  author = {Yang, Kevin and Klein, Dan and Peng, Nanyun and Tian, Yuandong},
  booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2023}
}

Details

Are Fairy Tales Fair? Analyzing Gender Bias in Temporal Narrative Event Chains of Children’s Fairy Tales

Paulina Toro Isaza, Guangxuan Xu, Toye Oloko, Yufang Hou, Nanyun Peng, and Dakuo Wang, in Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
Full Text BibTeX Details

@inproceedings{isaza2023fairytales,
  title = {Are Fairy Tales Fair? Analyzing Gender Bias in Temporal Narrative Event Chains of Children's Fairy Tales},
  author = {Isaza, Paulina Toro and Xu, Guangxuan and Oloko, Toye and Hou, Yufang and Peng, Nanyun and Wang, Dakuo},
  booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2023}
}

Details

SIMMC-VR: A Task-oriented Multimodal Dialog Dataset with Situated and Immersive VR Streams

Te-Lin Wu, Satwik Kottur, Andrea Madotto, Mahmoud Azab, Pedro Rodriguez, Nanyun Peng, Babak Damavandi, and Seungwhan Moon, in Proceedings of the Conference of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
Full Text Abstract BibTeX Details

Building an AI assistant that can seamlessly converse and instruct humans, in a user-centric situated scenario, requires several essential abilities: (1) spatial and temporal understanding of the situated and real-time user scenes, (2) capability of grounding the actively perceived visuals of users to conversation contexts, and (3) conversational reasoning over past utterances to perform just-in-time assistance. However, we currently lack a large-scale benchmark that captures user–assistant interactions with all of the aforementioned features. To this end, we propose SIMMC-VR, extending the SIMMC 2.0 dataset, which only concerns static visual scenes, to a video-grounded task-oriented dialog dataset that captures real-world AI-assisted user scenarios in VR. We propose a novel data collection paradigm that involves (1) generating object-centric multimodal dialog flows with egocentric visual streams and visually-grounded templates, and (2) manually paraphrasing the simulated dialogs for naturalness and diversity while preserving multimodal dependencies.  To measure meaningful progress in the field, we propose four tasks to address the new challenges in SIMMC-VR, which require complex spatial-temporal dialog reasoning in active egocentric scenes. We benchmark the proposed tasks with strong multimodal models, and highlight the key capabilities that current models lack for future research directions.

@inproceedings{wu2023simmcvr,
  title = {SIMMC-VR: A Task-oriented Multimodal Dialog Dataset with Situated and Immersive VR Streams},
  author = {Wu, Te-Lin and Kottur, Satwik and Madotto, Andrea and Azab, Mahmoud and Rodriguez, Pedro and Peng, Nanyun and Damavandi, Babak and Moon, Seungwhan},
  booktitle = {Proceedings of the Conference of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2023}
}

Details

Code-Switching Text Synthesis in Unseen Language Pairs

I.-Hung Hsu, Avik Ray, Shubham Grag, Nanyun Peng, and Jing Huang, in Findings of the Association for Computational Linguistics: ACL (ACL-findings), 2023.
Full Text Slides Video BibTeX Details

@inproceedings{hsu2023codeswitch,
  title = {Code-Switching Text Synthesis in Unseen Language Pairs},
  author = {Hsu, I-Hung and Ray, Avik and Grag, Shubham and Peng, Nanyun and Huang, Jing},
  booktitle = {Findings of the Association for Computational Linguistics: ACL (ACL-findings)},
  year = {2023}
}

Details

Tractable Control for Autoregressive Language Generation

Honghua Zhang, Meihua Dang, Nanyun Peng, and Guy Van den Broeck, in Proceedings of the Fortieth International Conference on Machine Learning (ICML), 2023.
Full Text BibTeX Details Oral Paper (<2%)

@inproceedings{zhang2023gelato,
  title = {Tractable Control for Autoregressive Language Generation},
  author = {Zhang, Honghua and Dang, Meihua and Peng, Nanyun and Broeck, Guy Van den},
  booktitle = {Proceedings of the Fortieth International Conference on Machine Learning (ICML)},
  year = {2023}
}

Details

Generalized Decoding for Pixel, Image and Language

Xueyan Zou*, Zi-Yi Dou*, Jianwei Yang*, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, and Jianfeng Gao, in The Conference on Computer Vision and Pattern Recognition (CVPR-23), 2023.
Full Text Code BibTeX Details

@inproceedings{xdecoder,
  title = {Generalized Decoding for Pixel, Image and Language},
  author = {Zou*, Xueyan and Dou*, Zi-Yi and Yang*, Jianwei and Gan, Zhe and Li, Linjie and Li, Chunyuan and Dai, Xiyang and Behl, Harkirat and Wang, Jianfeng and Yuan, Lu and Peng, Nanyun and Wang, Lijuan and Lee, Yong Jae and Gao, Jianfeng},
  booktitle = {The Conference on Computer Vision and Pattern Recognition (CVPR-23)},
  year = {2023}
}

Details

Where Does Your News Come From? Predicting Information Pathways in Social Media

Alexander Taylor, Nuan Wen, Po-Nien Kung, Jiaao Chen, Nanyun Peng, and Wei Wang, in Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2023.
Full Text BibTeX Details

@inproceedings{taylor2023pathway,
  title = {Where Does Your News Come From? Predicting Information Pathways in Social Media},
  author = {Taylor, Alexander and Wen, Nuan and Kung, Po-Nien and Chen, Jiaao and Peng, Nanyun and Wang, Wei},
  booktitle = {Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information
  Retrieval (SIGIR)},
  year = {2023}
}

Details

MERCY: Multiple Response Ranking Concurrently in Realistic Open-Domain Conversational Systems

Sarik Ghazarian, Behnam Hedayatnia, Di Jin, Sijia Liu, Nanyun Peng, Yang Liu, and Dilek Hakkani-Tur, in Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2023.
Full Text Abstract BibTeX Details

Automatic Evaluation (AE) and Response Selection (RS) models assign quality scores to various candidate responses and rank them in conversational setups. Prior response ranking research compares various models’ performance on synthetically generated test sets. In this work, we investigate the performance of model-based reference-free AE and RS models on our constructed response ranking datasets that mirror real-case scenarios of ranking candidates during inference time. Metrics’ unsatisfying performance can be interpreted as their low generalizability over more pragmatic conversational domains such as human-chatbot dialogs. To alleviate this issue we propose a novel RS model called MERCY that simulates human behavior in selecting the best candidate by taking into account distinct candidates concurrently and learns to rank them. In addition, MERCY leverages natural language feedback as another component to help the ranking task by explaining why each candidate response is relevant/irrelevant to the dialog context. These feedbacks are generated by prompting large language models in a few-shot setup. Our experiments show the better performance of MERCY over baselines for the response ranking task in our curated realistic datasets.

@inproceedings{ghazarian-etal-2023-mercy,
  title = {{MERCY}: Multiple Response Ranking Concurrently in Realistic Open-Domain Conversational Systems},
  author = {Ghazarian, Sarik and Hedayatnia, Behnam and Jin, Di and Liu, Sijia and Peng, Nanyun and Liu, Yang and Hakkani-Tur, Dilek},
  booktitle = {Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue},
  year = {2023}
}

Details

Investigating the Representation of Open Domain Dialogue Context for Transformer Models

Vishakh Padmakumar, Behnam Hedayatnia, Di Jin, Patrick Lange, Seokhwan Kim, Nanyun Peng, Yang Liu, and Dilek Hakkani-Tur, in Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2023.
Full Text Abstract BibTeX Details

The bulk of work adapting transformer models to open-domain dialogue represents dialogue context as the concatenated set of turns in natural language. However, it is unclear if this is the best approach. In this work, we investigate this question by means of an empirical controlled experiment varying the dialogue context format from text-only formats (all recent utterances, summaries, selected utterances) as well as variants that are more structurally different (triples, AMR). We compare these formats based on fine-tuned model performance on two downstream tasks—knowledge selection and response generation. We find that simply concatenating the utterances works as a strong baseline in most cases, but is outperformed in longer contexts by a hybrid approach of combining a summary of the context with recent utterances. Through empirical analysis, our work highlights the need to examine the format of context representation and offers recommendations on adapting general-purpose language models to dialogue tasks.

@inproceedings{padmakumar-etal-2023-investigating,
  title = {Investigating the Representation of Open Domain Dialogue Context for Transformer Models},
  author = {Padmakumar, Vishakh and Hedayatnia, Behnam and Jin, Di and Lange, Patrick and Kim, Seokhwan and Peng, Nanyun and Liu, Yang and Hakkani-Tur, Dilek},
  booktitle = {Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue},
  year = {2023}
}

Details

2022

Character-Centric Story Visualization via Visual Planning and Token Alignment

Hong Chen, Rujun Han, Te-Lin Wu, Hideki Nakayama, and Nanyun Peng, in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
Full Text BibTeX Details

@inproceedings{hong2022Character,
  title = {Character-Centric Story Visualization via Visual Planning and Token Alignment},
  author = {Chen, Hong and Han, Rujun and Wu, Te-Lin and Nakayama, Hideki and Peng, Nanyun},
  booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2022}
}

Details

ExPUNations: Augmenting Puns with Keywords and Explanations

Jiao Sun, Anjali Narayan-Chen, Shereen Oraby, Alessandra Cervone, Tagyoung Chung, Jing Huang, Yang Liu, and Nanyun Peng, in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
Full Text BibTeX Details

@inproceedings{sun2022expun,
  title = {ExPUNations: Augmenting Puns with Keywords and Explanations},
  author = {Sun, Jiao and Narayan-Chen, Anjali and Oraby, Shereen and Cervone, Alessandra and Chung, Tagyoung and Huang, Jing and Liu, Yang and Peng, Nanyun},
  booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2022}
}

Details

Context-Situated Pun Generation

Jiao Sun, Anjali Narayan-Chen, Shereen Oraby, Shuyang Gao, Tagyoung Chung, Jing Huang, Yang Liu, and Nanyun Peng, in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
Full Text BibTeX Details

@inproceedings{sun2022context,
  title = {Context-Situated Pun Generation},
  author = {Sun, Jiao and Narayan-Chen, Anjali and Oraby, Shereen and Gao, Shuyang and Chung, Tagyoung and Huang, Jing and Liu, Yang and Peng, Nanyun},
  booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2022}
}

Details

Re3: Generating Longer Stories With Recursive Reprompting and Revision

Kevin Yang, Yuandong Tian, Nanyun Peng, and Dan Klein, in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
Full Text BibTeX Details

@inproceedings{yang2022re3,
  title = {Re3: Generating Longer Stories With Recursive Reprompting and Revision},
  author = {Yang, Kevin and Tian, Yuandong and Peng, Nanyun and Klein, Dan},
  booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2022}
}

Details

A Unified Framework for Pun Generation with Humor Principles

Yufei Tian, Divyanshu Arun Sheth, and Nanyun Peng, in Findings of the Association for Computational Linguistics: EMNLP (EMNLP-findings), 2022.
Full Text BibTeX Details

@inproceedings{tian2022unified,
  title = {A Unified Framework for Pun Generation with Humor Principles},
  author = {Tian, Yufei and Arun Sheth, Divyanshu and Peng, Nanyun},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP (EMNLP-findings)},
  year = {2022}
}

Details

Sequentially Controlled Text Generation

Alexander Spangher, Yao Ming, Xinyu Hua, and Nanyun Peng, in Findings of the Association for Computational Linguistics: EMNLP (EMNLP-findings), 2022.
Full Text BibTeX Details

@inproceedings{spangher2022sequentially,
  title = {Sequentially Controlled Text Generation},
  author = {Spangher, Alexander and Ming, Yao and Hua, Xinyu and Peng, Nanyun},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP (EMNLP-findings)},
  year = {2022}
}

Details

Towards Robust NLG Evaluation with Syntactically-diverse Prompts

Arshiya Aggarwal, Jiao Sun, and Nanyun Peng, in Findings of the Association for Computational Linguistics: EMNLP (EMNLP-findings), 2022.
Full Text BibTeX Details

@inproceedings{aggarwal2022towards,
  title = {Towards Robust NLG Evaluation with Syntactically-diverse Prompts},
  author = {Aggarwal, Arshiya and Sun, Jiao and Peng, Nanyun},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP (EMNLP-findings)},
  year = {2022}
}

Details

EnDex: Evaluation of Dialogue Engagingness at Scale

Guangxuan Xu, Nischal Reddy Chandra, Ruibo Liu, Fabrice Harel-Canada, and Nanyun Peng, in Findings of the Association for Computational Linguistics: EMNLP (EMNLP-findings), 2022.
Full Text BibTeX Details

@inproceedings{xu2022endex,
  title = {EnDex: Evaluation of Dialogue Engagingness at Scale},
  author = {Xu, Guangxuan and Chandra, Nischal Reddy and Liu, Ruibo and Harel-Canada, Fabrice and Peng, Nanyun},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP (EMNLP-findings)},
  year = {2022}
}

Details

InsNet: An Efficient, Flexible, and Performant Insertion-based Text Generation Model

Sidi Lu, Tao Meng, and Nanyun Peng, in Proceedings of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS), 2022.
Full Text BibTeX Details

@inproceedings{lu2022InsNet,
  title = {InsNet: An Efficient, Flexible, and Performant Insertion-based Text Generation Model},
  author = {Lu, Sidi and Meng, Tao and Peng, Nanyun},
  booktitle = {Proceedings of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS)},
  year = {2022}
}

Details

Controllable Text Generation with Neurally-Decomposed Oracle

Tao Meng, Sidi Lu, Nanyun Peng, and Kai-Wei Chang, in Proceedings of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS), 2022.
Full Text BibTeX Details Oral Paper (<2%)

@inproceedings{meng2022nado,
  title = {Controllable Text Generation with Neurally-Decomposed Oracle},
  author = {Meng, Tao and Lu, Sidi and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {Proceedings of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS)},
  year = {2022}
}

Details

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, Jianfeng Gao, and Lijuan Wang, in Proceedings of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS), 2022.
Full Text BibTeX Details

@inproceedings{dou2022fiber,
  title = {Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone},
  author = {Dou, Zi-Yi and Kamath, Aishwarya and Gan, Zhe and Zhang, Pengchuan and Wang, Jianfeng and Li, Linjie and Liu, Zicheng and Liu, Ce and LeCun, Yann and Peng, Nanyun and Gao, Jianfeng and Wang, Lijuan},
  booktitle = {Proceedings of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS)},
  year = {2022}
}

Details

Controllable Text Generation for Open-Domain Creativity and Fairness

Nanyun Peng, in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22), Early Career Track, 2022.
Full Text BibTeX Details

@inproceedings{peng2022controllable,
  title = {Controllable Text Generation for Open-Domain Creativity and Fairness},
  author = {Peng, Nanyun},
  booktitle = {Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22), Early Career Track},
  year = {2022}
}

Details

NewsEdits: A News Article Revision Dataset and a Novel Document-Level Reasoning Challenge

Alexander Spangher, Xiang Ren, Jonathan May, and Nanyun Peng, in Proceedings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2022.
Full Text Code BibTeX Details 🏆 Outstanding Paper Award (<0.4%)

@inproceedings{spangher2022news,
  title = {NewsEdits: A News Article Revision Dataset and a Novel Document-Level Reasoning Challenge},
  author = {Spangher, Alexander and Ren, Xiang and May, Jonathan and Peng, Nanyun},
  booktitle = {Proceedings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year = {2022}
}

Details

Zero-Shot Sonnet Generation with Discourse-Level Planning and Aesthetics Features

Yufei Tian and Nanyun Peng, in 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2022.
Full Text Code BibTeX Details

@inproceedings{tian2022sonnet,
  title = {Zero-Shot Sonnet Generation with Discourse-Level Planning and Aesthetics Features},
  author = {Tian, Yufei and Peng, Nanyun},
  booktitle = {2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year = {2022}
}

Details

Multilingual Generative Language Models for Zero-Shot Cross-Lingual Event Argument Extraction

Kuan-Hao Huang*, I.-Hung Hsu*, Premkumar Natarajan, Kai-Wei Chang, and Nanyun Peng, in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
Full Text Slides Poster Code Abstract BibTeX Details

We present a study on leveraging multilingual pre-trained generative language models for zero-shot cross-lingual event argument extraction (EAE). By formulating EAE as a language generation task, our method effectively encodes event structures and captures the dependencies between arguments. We design language-agnostic templates to represent the event argument structures, which are compatible with any language, hence facilitating the cross-lingual transfer. Our proposed model finetunes multilingual pre-trained generative language models to generate sentences that fill in the language-agnostic template with arguments extracted from the input passage. The model is trained on source languages and is then directly applied to target languages for event argument extraction. Experiments demonstrate that the proposed model outperforms the current state-of-the-art models on zero-shot cross-lingual EAE. Comprehensive studies and error analyses are presented to better understand the advantages and the current limitations of using generative language models for zero-shot cross-lingual transfer EAE.

@inproceedings{huang2022multilingual,
  title = {Multilingual Generative Language Models for Zero-Shot Cross-Lingual Event Argument Extraction},
  author = {Huang*, Kuan-Hao and Hsu*, I-Hung and Natarajan, Premkumar and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2022}
}

Details

Go Back in Time: Generating Flashbacks in Stories with Event Temporal Prompts

Rujun Han, Hong Chen, Yufei Tian, and Nanyun Peng, in 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2022.
Full Text Code BibTeX Details

@inproceedings{han2022go,
  title = {Go Back in Time: Generating Flashbacks in Stories with Event Temporal Prompts},
  author = {Han, Rujun and Chen, Hong and Tian, Yufei and Peng, Nanyun},
  booktitle = {2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year = {2022}
}

Details

FOAM: A Follower-aware Speaker Model for Vision-and-Language Navigation

Zi-Yi Dou and Nanyun Peng, in Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), short, 2022.
Full Text Code BibTeX Details

@inproceedings{dou2022foam,
  title = {FOAM: A Follower-aware Speaker Model for Vision-and-Language Navigation},
  author = {Dou, Zi-Yi and Peng, Nanyun},
  booktitle = {Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), short},
  year = {2022}
}

Details

AmbiPun: Generating Humorous Puns with Ambiguous Context

Anirudh Mittal, Yufei Tian, and Nanyun Peng, in 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), short, 2022.
Full Text Code BibTeX Details

@inproceedings{Mittal2022ambipun,
  title = {AmbiPun: Generating Humorous Puns with Ambiguous Context},
  author = {Mittal, Anirudh and Tian, Yufei and Peng, Nanyun},
  booktitle = {2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), short},
  year = {2022}
}

Details

Socially Aware Bias Measurements for Hindi Language Representations

Vijit Malik, Sunipa Dev, Akihiro Nishi, Nanyun Peng, and Kai-Wei Chang, in Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), short, 2022.
Full Text BibTeX Details

@inproceedings{malik2022socially,
  title = {Socially Aware Bias Measurements for Hindi Language Representations},
  author = {Malik, Vijit and Dev, Sunipa and Nishi, Akihiro and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), short},
  year = {2022}
}

Details

An Empirical Study of Training End-to-End Vision-and-Language Transformers

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu, and Michael Zeng, in The Conference on Computer Vision and Pattern Recognition (CVPR-22), 2022.
Full Text Code Abstract BibTeX Details

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present METER, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion module (e.g., merged attention vs. co-attention), architectural design (e.g., encoder-only vs. encoder-decoder), and pre-training objectives (e.g., masked image modeling). We conduct comprehensive experiments and provide insights on how to train a performant VL transformer while maintaining fast inference speed. Notably, our best model achieves an accuracy of 77.64% on the VQAv2 test-std set using only 4M images for pre-training, surpassing the state-of-the-art region-feature-based model by 1.04%, and outperforming the previous best fully transformer-based model by 1.6%.

@inproceedings{dou2022meter,
  title = {An Empirical Study of Training End-to-End Vision-and-Language Transformers},
  author = {Dou, Zi-Yi and Xu, Yichong and Gan, Zhe and Wang, Jianfeng and Wang, Shuohang and Wang, Lijuan and Zhu, Chenguang and Zhang, Pengchuan and Yuan, Lu and Peng, Nanyun and Liu, Zicheng and Zeng, Michael},
  booktitle = {The Conference on Computer Vision and Pattern Recognition (CVPR-22)},
  year = {2022}
}

Details

DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations

Sarik Ghazarian, Nuan Wen, Aram Galstyan, and Nanyun Peng, in Proceedings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
Full Text Abstract BibTeX Details

Automatic evaluation metrics are essential for the rapid development of open-domain dialogue systems as they facilitate hyper-parameter tuning and comparison between models. Although recently proposed trainable conversation-level metrics have shown encouraging results, the quality of the metrics is strongly dependent on the quality of training data. Prior works mainly resort to heuristic text-level manipulations (e.g. utterances shuffling) to bootstrap incoherent conversations (negative examples) from coherent dialogues (positive examples). Such approaches are insufficient to appropriately reflect the incoherence that occurs in interactions between advanced dialogue models and humans. To tackle this problem, we propose DEAM, a Dialogue coherence Evaluation metric that relies on Abstract Meaning Representation (AMR) to apply semantic-level Manipulations for incoherent (negative) data generation. AMRs naturally facilitate the injection of various types of incoherence sources, such as coreference inconsistency, irrelevancy, contradictions, and decrease engagement, at the semantic level, thus resulting in more natural incoherent samples. Our experiments show that DEAM achieves higher correlations with human judgments compared to baseline methods on several dialog datasets by significant margins. We also show that DEAM can distinguish between coherent and incoherent dialogues generated by baseline manipulations, whereas those baseline models cannot detect incoherent examples generated by DEAM. Our results demonstrate the potential of AMR-based semantic manipulations for natural negative example generation.

@inproceedings{ghazarian2022deam,
  title = {DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations},
  author = {Ghazarian, Sarik and Wen, Nuan and Galstyan, Aram and Peng, Nanyun},
  booktitle = {Proceedings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2022}
}

Details

DEGREE: A Data-Efficient Generative Event Extraction Model

I.-Hung Hsu*, Kuan-Hao Huang*, Elizabeth Boschee, Scott Miller, Premkumar Natarajan, Kai-Wei Chang, and Nanyun Peng, in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2022.
Full Text Slides Video Code Abstract BibTeX Details

Event extraction requires high-quality expert human annotations, which are usually expensive. Therefore, learning a data-efficient event extraction model that can be trained with only a few labeled examples has become a crucial challenge. In this paper, we focus on low-resource end-to-end event extraction and propose DEGREE, a data-efficient model that formulates event extraction as a conditional generation problem. Given a passage and a manually designed prompt, DEGREE learns to summarize the events mentioned in the passage into a natural sentence that follows a predefined pattern. The final event predictions are then extracted from the generated sentence with a deterministic algorithm. DEGREE has three advantages to learn well with less training data. First, our designed prompts provide semantic guidance for DEGREE to leverage DEGREE and thus better capture the event arguments. Moreover, DEGREE is capable of using additional weakly-supervised information, such as the description of events encoded in the prompts. Finally, DEGREE learns triggers and arguments jointly in an end-to-end manner, which encourages the model to better utilize the shared knowledge and dependencies among them. Our experimental results demonstrate the strong performance of DEGREE for low-resource event extraction.

@inproceedings{hsu2022degree,
  title = {DEGREE: A Data-Efficient Generative Event Extraction Model},
  author = {Hsu*, I-Hung and Huang*, Kuan-Hao and Boschee, Elizabeth and Miller, Scott and Natarajan, Premkumar and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL)},
  year = {2022}
}

Details

Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals

Te-Lin Wu, Alex Spangher, Pegah Alipoormolabashi, Marjorie Freedman, Ralph Weischedel, and Nanyun Peng, in Proceedings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
Full Text Abstract BibTeX Details

The ability to sequence unordered events is evidence of comprehension and reasoning about real world tasks/procedures, and is essential for applications such as task planning and multi-source instruction summarization. It often requires thorough understanding of temporal common sense and multimodal information, since these procedures are often conveyed by a combination of texts and images. While humans are capable of reasoning about and sequencing unordered procedural instructions,  the extent to which the current machine learning methods possess such a capability is still an open question. In this work, we benchmark models’ capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from online instructional manuals and collecting comprehensive human annotations. We find current state-of-the-art models not only perform significantly worse than humans but also seem incapable of efficiently utilizing  multimodal information. To improve machines’ performance on multimodal event sequencing, we propose sequence-aware pretraining techniques exploiting the sequential alignment properties of both texts and images, resulting in >5% improvements on perfect match ratio.

@inproceedings{wu2022procedural,
  title = {Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals},
  author = {Wu, Te-Lin and Spangher, Alex and Alipoormolabashi, Pegah and Freedman, Marjorie and Weischedel, Ralph and Peng, Nanyun},
  booktitle = {Proceedings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2022}
}

Details

Fantastic Questions and Where to Find Them: FairytaleQA–An Authentic Dataset for Narrative Comprehension

Ying Xu, Dakuo Wang, Mo Yu, Daniel Ritchie, Bingsheng Yao, Tongshuang Wu, Zheng Zhang, Toby Jia-Jun Li, Nora Bradford, Branda Sun, Tran Hoang, Yisi Sang, Yufang Hou, Xiaojuan Ma, Diyi Yang, Nanyun Peng, Zhou Yu, and Mark Warschauer, in Proceedings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
BibTeX Details

@inproceedings{xu2022fairy,
  title = {Fantastic Questions and Where to Find Them: FairytaleQA--An Authentic Dataset for Narrative Comprehension},
  author = {Xu, Ying and Wang, Dakuo and Yu, Mo and Ritchie, Daniel and Yao, Bingsheng and Wu, Tongshuang and Zhang, Zheng and Li, Toby Jia-Jun and Bradford, Nora and Sun, Branda and Hoang, Tran and Sang, Yisi and Hou, Yufang and Ma, Xiaojuan and Yang, Diyi and Peng, Nanyun and Yu, Zhou and Warschauer, Mark},
  booktitle = {Proceedings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2022}
}

Details

Sibylvariant Transformations for Robust Text Classification

Fabrice Y. Harel-Canada, Muhammad Ali Gulzar, Nanyun Peng, and Miryung Kim, in Findings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL-findings), 2022.
BibTeX Details

@inproceedings{harel-canada2022sibyl,
  title = {Sibylvariant Transformations for Robust Text Classification},
  author = {Harel-Canada, Fabrice Y and Gulzar, Muhammad Ali and Peng, Nanyun and Kim, Miryung},
  booktitle = {Findings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL-findings)},
  year = {2022}
}

Details

On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark

Hao Sun, Guangxuan Xu, Jiawen Deng, Jiale Cheng, Chujie Zheng, Hao Zhou, Nanyun Peng, Xiaoyan Zhu, and Minlie Huang, in Findings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL-findings), 2022.
Full Text Abstract BibTeX Details

Dialogue safety problems severely limit the real-world deployment of neural conversational models and have attracted great research interests recently. However, dialogue safety problems remain under-defined and the corresponding dataset is scarce. We propose a taxonomy for dialogue safety specifically designed to capture unsafe behaviors in human-bot dialogue settings, with focuses on context-sensitive unsafety, which is under-explored in prior works. To spur research in this direction, we compile DiaSafety, a dataset with rich context-sensitive unsafe examples. Experiments show that existing safety guarding tools fail severely on our dataset. As a remedy, we train a dialogue safety classifier to provide a strong baseline for context-sensitive dialogue unsafety detection. With our classifier, we perform safety evaluations on popular conversational models and show that existing dialogue systems still exhibit concerning context-sensitive safety problems.

@inproceedings{sun2022safe,
  title = {On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark},
  author = {Sun, Hao and Xu, Guangxuan and Deng, Jiawen and Cheng, Jiale and Zheng, Chujie and Zhou, Hao and Peng, Nanyun and Zhu, Xiaoyan and Huang, Minlie},
  booktitle = {Findings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL-findings)},
  year = {2022}
}

Details

Zero-shot Commonsense Question Answering with Cloze Translation and Consistency Optimization

Zi-Yi Dou and Nanyun Peng, in The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI), 2022.
Full Text Code Abstract BibTeX Details

Commonsense question answering (CQA) aims to test if models can answer questions regarding commonsense knowledge that everyone knows. Prior works that incorporate external knowledge bases have shown promising results, but knowledge bases are expensive to construct and are often limited to a fixed set of relations. In this paper, we instead focus on better utilizing the implicit knowledge stored in pre-trained language models. While researchers have found that the knowledge embedded in pre-trained language models can be extracted by having them fill in the blanks of carefully designed prompts for relation extraction and text classification, it remains unclear if we can adopt this paradigm in CQA where the inputs and outputs take much more flexible forms. To this end, we investigate four translation methods that can translate natural questions into cloze-style sentences to better solicit commonsense knowledge from language models, including a syntactic-based model, an unsupervised neural model, and two supervised neural models. In addition, to combine the different translation methods, we propose to encourage consistency among model predictions on different translated questions with unlabeled data. We demonstrate the effectiveness of our methods on three CQA datasets in zero-shot settings. We show that our methods are complementary to a knowledge base improved model, and combining them can lead to state-of-the-art zero-shot performance. Analyses also reveal distinct characteristics of the different cloze translation methods and provide insights on why combining them can lead to great improvements.

@inproceedings{dou2022improving,
  title = {Zero-shot Commonsense Question Answering with Cloze Translation and Consistency Optimization},
  author = {Dou, Zi-Yi and Peng, Nanyun},
  booktitle = {The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI)},
  year = {2022}
}

Details

Discourse-level Relation Extraction via Graph Pooling

I.-Hung Hsu, Xiao Guo, Premkumar Natarajan, and Nanyun Peng, in The Thirty-Sixth AAAI Conference On Artificial Intelligence Workshop on Deep Learning on Graphs: Method and Applications (DLG-AAAI), 2022.
BibTeX Details 🏆 Best Paper Award

@inproceedings{hsu2021discourse,
  title = {Discourse-level Relation Extraction via Graph Pooling},
  author = {Hsu, I-Hung and Guo, Xiao and Natarajan, Premkumar and Peng, Nanyun},
  booktitle = {The Thirty-Sixth AAAI Conference On Artificial Intelligence Workshop on Deep Learning on Graphs: Method and Applications (DLG-AAAI)},
  year = {2022}
}

Details

2021

Document-level Entity-based Extraction as Template Generation

Kung-Hsiang Huang, Sam Tang, and Nanyun Peng, in The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
Full Text Code Abstract BibTeX Details

Document-level entity-based extraction (EE), aiming at extracting entity-centric information such as entity roles and entity relations, is key to automatic knowledge acquisition from text corpora for various domains. Most document-level EE systems build extractive models, which struggle to model long-term dependencies among entities at the document level. To address this issue, we propose a generative framework for two document-level EE tasks: role-filler entity extraction (REE) and relation extraction (RE). We first formulate them as a template generation problem, allowing models to efficiently capture cross-entity dependencies, exploit label semantics, and avoid the exponential computation complexity of identifying N-ary relations. A novel cross-attention guided copy mechanism, TopK Copy, is incorporated into a pre-trained sequence-to-sequence model to enhance the capabilities of identifying key information in the input document. Experiments done on the MUC-4 and SciREX dataset show new state-of-the-art results on REE (+3.26%), binary RE (+4.8%), and 4-ary RE (+2.7%) in F1 score.

@inproceedings{huang2021tempgen,
  title = {Document-level Entity-based Extraction as Template Generation},
  author = {Huang, Kung-Hsiang and Tang, Sam and Peng, Nanyun},
  booktitle = {The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2021}
}

Details

AESOP: Paraphrase Generation with Adaptive Syntactic Control

Jiao Sun, Xuezhe Ma, and Nanyun Peng, in The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
Full Text Code Abstract BibTeX Details

We propose to control paraphrase generation through carefully chosen target syntactic structures to generate more proper and higher quality paraphrases. Our model, AESOP, leverages a pretrained language model and adds deliberately chosen syntactical control via a retrieval-based selection module to generate fluent paraphrases. Experiments show that AESOP achieves state-of-the-art performances on semantic preservation and syntactic conformation on two benchmark datasets with ground-truth syntactic control from human-annotated exemplars. Moreover, with the retrieval-based target syntax selection module, AESOP generates paraphrases with even better qualities than the current best model using human-annotated target syntactic parses according to human evaluation. We further demonstrate the effectiveness of AESOP to improve classification models’ robustness to syntactic perturbation by data augmentation on two GLUE tasks.

@inproceedings{sun2021aesop,
  title = {AESOP: Paraphrase Generation with Adaptive Syntactic Control},
  author = {Sun, Jiao and Ma, Xuezhe and Peng, Nanyun},
  booktitle = {The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2021}
}

Details

ESTER: A Machine Reading Comprehension Dataset for Event Semantic Relation Reasoning

Rujun Han, I.-Hung Hsu, Jiao Sun, Julia Baylon, Qiang Ning, Dan Roth, and Nanyun Peng, in The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
Full Text Code Abstract BibTeX Details

Understanding how events are semantically related to each other is the essence of reading comprehension. Recent event-centric reading comprehension datasets focus mostly on event arguments or temporal relations. While these tasks partially evaluate machines’ ability of narrative understanding, human-like reading comprehension requires the capability to process event-based information beyond arguments and temporal reasoning. For example, to understand causality between events, we need to infer motivation or purpose; to establish event hierarchy, we need to understand the composition of events. To facilitate these tasks, we introduce ESTER, a comprehensive machine reading comprehension (MRC) dataset for Event Semantic Relation Reasoning. The dataset leverages natural language queries to reason about the five most common event semantic relations, provides more than 6K questions, and captures 10.1K event relation pairs. Experimental results show that the current SOTA systems achieve 22.1%, 63.3% and 83.5% for token-based exact-match (EM), F1 and event-based HIT@1 scores, which are all significantly below human performances (36.0%, 79.6%, 100% respectively), highlighting our dataset as a challenging benchmark.

@inproceedings{han2021ester,
  title = {ESTER: A Machine Reading Comprehension Dataset for Event Semantic Relation Reasoning},
  author = {Han, Rujun and Hsu, I-Hung and Sun, Jiao and Baylon, Julia and Ning, Qiang and Roth, Dan and Peng, Nanyun},
  booktitle = {The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2021}
}

Details

ECONET: Effective Continual Pretraining of Language Models for Event Temporal Reasoning

Rujun Han, Xiang Ren, and Nanyun Peng, in The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
Full Text Code Abstract BibTeX Details

While pre-trained language models (PTLMs) have achieved noticeable success on many NLP tasks, they still struggle for tasks that require event temporal reasoning, which is essential for event-centric applications. We present a continual pre-training approach that equips PTLMs with targeted knowledge about event temporal relations. We design self-supervised learning objectives to recover masked-out event and temporal indicators and to discriminate sentences from their corrupted counterparts (where event or temporal indicators got replaced). By further pre-training a PTLM with these objectives jointly, we reinforce its attention to event and temporal information, yielding enhanced capability on event temporal reasoning. This Effective CONtinual pre-training framework for Event Temporal reasoning (ECONET) improves the PTLMs’ fine-tuning performances across five relation extraction and question answering tasks and achieves new or on-par state-of-the-art performances in most of our downstream tasks.

@inproceedings{han2021econet,
  title = {ECONET: Effective Continual Pretraining of Language Models for Event Temporal Reasoning},
  author = {Han, Rujun and Ren, Xiang and Peng, Nanyun},
  booktitle = {The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2021}
}

Details

Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding

Zi-Yi Dou and Nanyun Peng, in The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), short, 2021.
Full Text Code Abstract BibTeX Details

Phrase grounding aims to map textual phrases to their associated image regions, which can be a prerequisite for multimodal reasoning and can benefit tasks requiring identifying objects based on language. With pre-trained vision-and-language models achieving impressive performance across tasks, it remains unclear if we can directly utilize their learned embeddings for phrase grounding without fine-tuning. To this end, we propose a method to extract matched phrase-region pairs from pre-trained vision-and-language embeddings and propose four fine-tuning objectives to improve the model phrase grounding ability using image-caption data without any supervised grounding signals. Experiments on two representative datasets demonstrate the effectiveness of our objectives, outperforming baseline models in both weakly-supervised and supervised phrase grounding settings. In addition, we evaluate the aligned embeddings on several other downstream tasks and show that we can achieve better phrase grounding without sacrificing representation generality.

@inproceedings{dou2021improving,
  title = {Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding},
  author = {Dou, Zi-Yi and Peng, Nanyun},
  booktitle = {The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), short},
  year = {2021}
}

Details

Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training

Kuan-Hao Huang, Wasi Uddin Ahmad, Nanyun Peng, and Kai-Wei Chang, in The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
Full Text Code Abstract BibTeX Details

Pre-trained multilingual language encoders, such as multilingual BERT and XLM-R, show great potential for zero-shot cross-lingual transfer. However, these multilingual encoders do not precisely align words and phrases across languages. Especially, learning alignments in the multilingual embedding space usually requires sentence-level or word-level parallel corpora, which are expensive to be obtained for low-resource languages. An alternative is to make the multilingual encoders more robust; when fine-tuning the encoder using downstream task, we train the encoder to tolerate noise in the contextual embedding spaces such that even if the representations of different languages are not aligned well, the model can still achieve good performance on zero-shot cross-lingual transfer. In this work, we propose a learning strategy for training robust models by drawing connections between adversarial examples and the failure cases of zero-shot cross-lingual transfer. We adopt two widely used robust training methods, adversarial training and randomized smoothing, to train the desired robust model. The experimental results demonstrate that robust training improves zero-shot cross-lingual transfer on text classification tasks. The improvement is more significant in the generalized cross-lingual transfer setting, where the pair of input sentences belong to two different languages.

@inproceedings{huang2021improving,
  title = {Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training},
  author = {Huang, Kuan-Hao and Ahmad, Wasi Uddin and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2021}
}

Details

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

Da Yin, Liunian Harold Li, Ziniu Hu, Nanyun Peng, and Kai-Wei Chang, in The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
Full Text Video Code Abstract BibTeX Details

Commonsense is defined as the knowledge on which everyone agrees. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenes of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models’ ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition.

@inproceedings{yin2021broaden,
  title = {Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning},
  author = {Yin, Da and Li, Liunian Harold and Hu, Ziniu and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2021}
}

Details

HypoGen: Hyperbole Generation with Commonsense and Counterfactual Knowledge

Yufei Tian, Arvind krishna Sridhar, and Nanyun Peng, in Findings of the Association for Computational Linguistics: EMNLP, 2021.
Full Text Video Code Abstract BibTeX Details

 A hyperbole is an intentional and creative exaggeration not to be taken literally. Despite its ubiquity in daily life, the computational explorations of hyperboles are scarce. In this paper, we tackle the under-explored and challenging task: sentence-level hyperbole generation. We start with a representative syntactic pattern for intensification and systematically study the semantic (commonsense and counterfactual) relationships between each component in such hyperboles. We then leverage commonsense and counterfactual inference to generate hyperbole candidates based on our findings from the pattern, and train neural classifiers to rank and select high-quality hyperboles. Automatic and human evaluations show that our generation method is able to generate hyperboles creatively with high success rate and intensity.

@inproceedings{tian2021hypogen,
  title = {HypoGen: Hyperbole Generation with Commonsense and Counterfactual Knowledge},
  author = {Tian, Yufei and Sridhar, Arvind krishna and Peng, Nanyun},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP},
  year = {2021}
}

Details

HyperExpan: Taxonomy Expansion with Hyperbolic Representation Learning

Mingyu Derek Ma, Muhao Chen, Te-Lin Wu, and Nanyun Peng, in Findings of the Association for Computational Linguistics: EMNLP, 2021.
Full Text Slides Video Code Abstract BibTeX Details

Taxonomies are valuable resources for many applications, but the limited coverage due to the expensive manual curation process hinders their general applicability. Prior works attempt to automatically expand existing taxonomies to improve their coverage by learning concept embeddings in Euclidean space, while taxonomies, inherently hierarchical, more naturally align with the geometric properties of a hyperbolic space. In this paper, we present HyperExpan, a taxonomy expansion algorithm that seeks to preserve the structure of a taxonomy in a more expressive hyperbolic embedding space and learn to represent concepts and their relations with a Hyperbolic Graph Neural Network (HGNN). Speciﬁcally, HyperExpan leverages position embeddings to exploit the structure of the existing taxonomies, and characterizes the concept proﬁle information to support the inference on unseen concepts during training. Experiments show that our proposed HyperExpan outperforms baseline models with representation learning in a Euclidean feature space and achieves state-of-the-art performance on the taxonomy expansion benchmarks.

@inproceedings{ma2021hyperexpan,
  title = {HyperExpan: Taxonomy Expansion with Hyperbolic Representation Learning},
  author = {Ma, Mingyu Derek and Chen, Muhao and Wu, Te-Lin and Peng, Nanyun},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP},
  year = {2021}
}

Details

Men Are Elected, Women Are Married: Events Gender Bias on Wikipedia

Jiao Sun and Nanyun Peng, in Proceedings of the Conference of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021.
Full Text Code Abstract BibTeX Details 🏆 Best Paper Nomination

Human activities can be seen as sequences of events, which are crucial to understanding societies. Disproportional event distribution for different demographic groups can manifest and amplify social stereotypes, and potentially jeopardize the ability of members in some groups to pursue certain goals. In this paper, we present the first event-centric study of gender biases in a Wikipedia corpus. To facilitate the study, we curate a corpus of career and personal life descriptions with demographic information consisting of 7,854 fragments from 10,412 celebrities. Then we detect events with a state-of-the-art event detection model, calibrate the results using strategically generated templates, and extract events that have asymmetric associations with genders. Our study discovers that Wikipedia pages tend to intermingle personal life events with professional events for females but not for males, which calls for the awareness of the Wikipedia community to formalize guidelines and train the editors to mind the implicit biases that contributors carry. Our work also lays the foundation for future works on quantifying and discovering event biases at the corpus level.

@inproceedings{sun2021men,
  title = {Men Are Elected, Women Are Married: Events Gender Bias on Wikipedia},
  author = {Sun, Jiao and Peng, Nanyun},
  booktitle = {Proceedings of the Conference of the 59th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2021}
}

Details

Societal Biases in Language Generation: Progress and Challenges

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng, in Proceedings of the Conference of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021.
Full Text Abstract BibTeX Details

Technology for language generation has advanced rapidly, spurred by advancements in pre-training large models on massive amounts of data and the need for intelligent agents to communicate in a natural manner. While techniques can effectively generate fluent text, they can also produce undesirable societal biases that can have a disproportionately negative impact on marginalized populations. Language generation presents unique challenges for biases in terms of direct user interaction and the structure of decoding techniques. To better understand these challenges, we present a survey on societal biases in language generation, focusing on how data and techniques contribute to biases and progress towards reducing biases. Motivated by a lack of studies on biases from decoding techniques, we also conduct experiments to quantify the effects of these techniques. By further discussing general trends and open challenges, we call to attention promising directions for research and the importance of fairness and inclusivity considerations for language generation applications.

@inproceedings{sheng2021societal,
  title = {Societal Biases in Language Generation: Progress and Challenges},
  author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun},
  booktitle = {Proceedings of the Conference of the 59th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2021}
}

Details

Metaphor Generation with Conceptual Mappings

Kevin Stowe, Tuhin Chakrabarty, Nanyun Peng, Smaranda Muresan, and Iryna Gurevych, in Proceedings of the Conference of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021.
Full Text Code Abstract BibTeX Details

Generating metaphors is a difficult task as it requires understanding nuanced relationships between abstract concepts. In this paper, we aim to generate a metaphoric sentence given a literal expression by replacing relevant verbs. Guided by conceptual metaphor theory, we propose to control the generation process by encoding conceptual mappings between cognitive domains to generate meaningful metaphoric expressions. To achieve this, we develop two methods: 1) using FrameNetbased embeddings to learn mappings between domains and applying them at the lexical level (CM-Lex), and 2) deriving source/target pairs to train a controlled seq-to-seq generation model (CM-BART). We assess our methods through automatic and human evaluation for basic metaphoricity and conceptual metaphor presence. We show that the unsupervised CMLex model is competitive with recent deep learning metaphor generation systems, and CM-BART outperforms all other models both in automatic and human evaluations.

@inproceedings{stowe2021metaphor,
  title = {Metaphor Generation with Conceptual Mappings},
  author = {Stowe, Kevin and Chakrabarty, Tuhin and Peng, Nanyun and Muresan, Smaranda and Gurevych, Iryna},
  booktitle = {Proceedings of the Conference of the 59th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2021}
}

Details

COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences

Shikhar Singh, Nuan Wen, Yu Hou, Pegah Alipoormolabashi, Te-lin Wu, Xuezhe Ma, and Nanyun Peng, in Proceedings of Findings of the Conference of the 59th Annual Meeting of the Association for Computational Linguistics (ACL-Findings), 2021.
Full Text Code Abstract BibTeX Details

Commonsense reasoning is intuitive for humans but has been a long-term challenge for artificial intelligence (AI). Recent advancements in pretrained language models have shown promising results on several commonsense benchmark datasets. However, the reliability and comprehensiveness of these benchmarks towards assessing model’s commonsense reasoning ability remains unclear. To this end, we introduce a new commonsense reasoning benchmark dataset comprising natural language true/false statements, with each sample paired with its complementary counterpart, resulting in 4k sentence pairs. We propose a pairwise accuracy metric to reliably measure an agent’s ability to perform commonsense reasoning over a given situation. The dataset is crowdsourced and enhanced with an adversarial model-in-the-loop setup to incentivize challenging samples. To facilitate a systematic analysis of commonsense capabilities, we design our dataset along the dimensions of knowledge domains, reasoning scenarios and numeracy. Experimental results demonstrate that our strongest baseline (UnifiedQA-3B), after fine-tuning, achieves  71% standard accuracy and  51% pairwise accuracy, well below human performance ( 95% for both metrics).

@inproceedings{sw2021com,
  title = {COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences},
  author = {Singh, Shikhar and Wen, Nuan and Hou, Yu and Alipoormolabashi, Pegah and Wu, Te-lin and Ma, Xuezhe and Peng, Nanyun},
  booktitle = {Proceedings of Findings of the Conference of the 59th Annual Meeting of the Association for Computational Linguistics (ACL-Findings)},
  year = {2021}
}

Details

"Nice Try, Kiddo": Ad Hominems in Dialogue Systems

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng, in The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.
Full Text Video Code Abstract BibTeX Details

Ad hominem attacks are those that attack some feature of a person’s character instead of the position the person is maintaining. As a form of toxic and abusive language, ad hominems contain harmful language that could further amplify the skew of power inequality for marginalized populations. Since dialogue systems are designed to respond directly to user input, it is important to study ad hominems in these system responses. In this work, we propose categories of ad hominems that allow us to analyze human and dialogue system responses to Twitter posts. We specifically compare responses to Twitter posts about marginalized communities (#BlackLivesMatter, #MeToo) and other topics (#Vegan, #WFH). Furthermore, we propose a constrained decoding technique that uses salient n-gram similarity to apply soft constraints to top-k sampling and can decrease the amount of ad hominems generated by dialogue systems. Our results indicate that 1) responses composed by both humans and DialoGPT contain more ad hominems for discussions around marginalized communities versus other topics, 2) different amounts of ad hominems in the training data can influence the likelihood of the model generating ad hominems, and 3) we can thus carefully choose training data and use constrained decoding techniques to decrease the amount of ad hominems generated by dialogue systems.

@inproceedings{sheng2021nice,
  title = {"Nice Try, Kiddo": Ad Hominems in Dialogue Systems},
  author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun},
  booktitle = {The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  publisher = {Association for Computational Linguistics},
  pages = {750--767},
  year = {2021}
}

Details

Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation

Sarik Ghazarian, Zixi Liu, Akash S. M, Ralph Weischedel, Aram Galstyan, and Nanyun Peng, in The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.
Full Text Slides Code Abstract BibTeX Details

With the recent advances of open-domain story generation models, the lack of reliable automatic evaluation metrics becomes an increasingly imperative issue that hinders the development of such models. A critical bottleneck of obtaining a trustworthy learnable evaluation metric is the lack of high-quality training data for learning classifiers to efficiently distinguish between plausible and implausible machine-generated stories. Previous works relied on heuristically manipulate plausible examples to mimic possible system drawbacks such as repetition, contradiction, or irrelevant content in the text level, which can be unnatural and oversimplify the characteristics of implausible machine-generated stories. We propose to tackle these issues by generating a more comprehensive set of implausible stories using plots, which are structured representations of controllable factors used to generate stories.  Since these plots are compact and structured, it is easier to manipulate them to generate text with targeted undesirable properties, while at the same time maintain the naturalness of the generation. To improve the quality of incoherent stories, we further apply the adversarial filtering procedure to select a more nuanced set of implausible texts. We find that the evaluation metrics trained on our generated data result in more reliable automatic assessments that correlate remarkably better with human judgments than other baselines.

@inproceedings{ghazarian2021plot,
  title = {Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation},
  author = {Ghazarian, Sarik and Liu, Zixi and M, Akash S and Weischedel, Ralph and Galstyan, Aram and Peng, Nanyun},
  booktitle = {The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  publisher = {Association for Computational Linguistics},
  pages = {4334–-4344},
  year = {2021}
}

Details

MERMAID: Metaphor Generation with Symbolism and Discriminative Decoding

Tuhin Chakrabarty, Xurui Zhang, Smaranda Muresan, and Nanyun Peng, in The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.
Full Text Poster Code Abstract BibTeX Details

Generating metaphors is a challenging task as it requires a proper understanding of abstract concepts, making connections between unrelated concepts, and deviating from the literal meaning. In this paper, we aim to generate a metaphoric sentence given a literal expression by replacing relevant verbs. Based on a theoretically-grounded connection between metaphors and symbols, we propose a method to automatically construct a parallel corpus by transforming a large number of metaphorical sentences from the Gutenberg Poetry corpus (CITATION) to their literal counterpart using recent advances in masked language modeling coupled with commonsense inference. For the generation task, we incorporate a metaphor discriminator to guide the decoding of a sequence to sequence model fine-tuned on our parallel data to generate high-quality metaphors. Human evaluation on an independent test set of literal statements shows that our best model generates metaphors better than three well-crafted baselines 66% of the time on average. A task-based evaluation shows that human-written poems enhanced with metaphors proposed by our model are preferred 68% of the time compared to poems without metaphors.

@inproceedings{chakrabarty2021mermaid,
  title = {MERMAID: Metaphor Generation with Symbolism and Discriminative Decoding},
  author = {Chakrabarty, Tuhin and Zhang, Xurui and Muresan, Smaranda and Peng, Nanyun},
  booktitle = {The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  talk_url = {https://underline.io/events/122/sessions/4240/lecture/19642-mermaid-metaphor-generation-with-symbolism-and-discriminative-decoding},
  year = {2021}
}

Details

DiSCoL: Toward Engaging Dialogue Systems through Conversational Line Guided Response Generation

Sarik Ghazarian, Zixi Liu, Tuhin Chakrabarty, Xuezhe Ma, Aram Galstyan, and Nanyun Peng, in 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Demonstrations Track, 2021.
Full Text Code Abstract BibTeX Details

Having engaging and informative conversations with users is the utmost goal for open-domain conversational systems. Recent advances in transformer-based language models and their applications to dialogue systems have succeeded to generate fluent and human-like responses. However, they still lack control over the generation process towards producing contentful responses and achieving engaging conversations. To achieve this goal, we present DiSCoL (Dialogue Systems through Coversational Line guided response generation). DiSCoL is an open-domain dialogue system that leverages conversational lines (briefly convlines) as controllable and informative content-planning elements to guide the generation model produce engaging and informative responses. Two primary modules in DiSCoL’s pipeline are conditional generators trained for 1) predicting relevant and informative convlines for dialogue contexts and 2) generating high-quality responses conditioned on the predicted convlines. Users can also change the returned convlines to control the direction of the conversations towards topics that are more interesting for them. Through automatic and human evaluations, we demonstrate the efficiency of the convlines in producing engaging conversations.

@inproceedings{ghazarian2021discol,
  title = {DiSCoL: Toward Engaging Dialogue Systems through Conversational Line Guided Response Generation},
  author = {Ghazarian, Sarik and Liu, Zixi and Chakrabarty, Tuhin and Ma, Xuezhe and Galstyan, Aram and Peng, Nanyun},
  booktitle = {2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Demonstrations Track},
  pages = {26–-34},
  publisher = {Association for Computational Linguistics},
  year = {2021}
}

Details

EventPlus: A Temporal Event Understanding Pipeline

Mingyu Derek Ma, Jiao Sun, Mu Yang, Kung-Hsiang Huang, Nuan Wen, Shikhar Singh, Rujun Han, and Nanyun Peng, in 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Demonstrations Track, 2021.
Full Text Slides Poster Video Code Abstract BibTeX Details

We present EventPlus, a temporal event understanding pipeline that integrates various state-of-the-art event understanding components including event trigger and type detection, event argument detection, event duration and temporal relation extraction. Event information, especially event temporal knowledge, is a type of common sense knowledge that helps people understand how stories evolve and provides predictive hints for future events. EventPlus as the first comprehensive temporal event understanding pipeline provides a convenient tool for users to quickly obtain annotations about events and their temporal information for any user-provided document. Furthermore, we show EventPlus can be easily adapted to other domains (e.g., biomedical domain). We make EventPlus publicly available to facilitate event-related information extraction and downstream applications.

@inproceedings{ma2021eventplus,
  title = {EventPlus: A Temporal Event Understanding Pipeline},
  author = {Ma, Mingyu Derek and Sun, Jiao and Yang, Mu and Huang, Kung-Hsiang and Wen, Nuan and Singh, Shikhar and Han, Rujun and Peng, Nanyun},
  booktitle = {2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Demonstrations Track},
  year = {2021}
}

Details

Identifying Distributional Perspective Differences from Colingual Groups

Yufei Tian, Tuhin Chakrabarty, Fred Morstatter, and Nanyun Peng, in NAACL 2021 Workshop of Social NLP, 2021.
Full Text Code Abstract BibTeX Details

Perspective differences exist among different cultures or languages. A lack of mutual understanding among different groups about their perspectives on specific values or events may lead to uninformed decisions or biased opinions. Automatically understanding the group perspectives can provide essential background for many downstream applications of natural language processing techniques. In this paper, we study colingual groups and use language corpora as a proxy to identify their distributional perspectives. We present a novel computational approach to learn shared understandings, and benchmark our method by building culturally-aware models for the English, Chinese, and Japanese languages. On a held out set of diverse topics including marriage, corruption, democracy, our model achieves high correlation with human judgements regarding intra-group values and inter-group differences.

@inproceedings{tian2021identifying,
  title = {Identifying Distributional Perspective Differences from Colingual Groups},
  author = {Tian, Yufei and Chakrabarty, Tuhin and Morstatter, Fred and Peng, Nanyun},
  booktitle = {NAACL 2021 Workshop of Social NLP},
  year = {2021}
}

Details

Document-level Event Extraction with Efficient End-to-end Learning of Cross-event Dependencies

Kung-Hsiang Huang and Nanyun Peng, in The 3rd Workshop on Narrative Understanding (NAACL 2021), 2021.
Full Text Abstract BibTeX Details

Fully understanding narratives often requires identifying events in the context of whole documents and modeling the event relations. However, document-level event extraction is a challenging task as it requires the extraction of event and entity coreference, and capturing arguments that span across different sentences. Existing works on event extraction usually confine on extracting events from single sentences, which fail to capture the relationships between the event mentions at the scale of a document, as well as the event arguments that appear in a different sentence than the event trigger. In this paper, we propose an end-to-end model leveraging Deep Value Networks (DVN), a structured prediction algorithm, to efficiently capture cross-event dependencies for document-level event extraction. Experimental results show that our approach achieves comparable performance to CRF-based models on ACE05, while enjoys significantly higher computational efficiency.

@inproceedings{huang2021document,
  title = {Document-level Event Extraction with Efficient End-to-end Learning of Cross-event Dependencies},
  author = {Huang, Kung-Hsiang and Peng, Nanyun},
  booktitle = {The 3rd Workshop on Narrative Understanding (NAACL 2021)},
  year = {2021}
}

Details

Discourse Tagging for Scientific Evidence Extraction

Xiangci Li, Gully Burns, and Nanyun Peng, in The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2021.
Full Text Code Abstract BibTeX Details

Evidence plays a crucial role in any biomedical research narrative, providing justification for some claims and refutation for others. We seek to build models of scientific argument using information extraction methods from fulltext papers. We present the capability of automatically extracting text fragments from primary research papers that describe the evidence presented in that paper’s figures, which arguably provides the raw material of any scientific argument made within the paper. We apply richly contextualized deep representation learning pre-trained on biomedical domain corpus to the analysis of scientific discourse structures and the extraction of "evidence fragments" (i.e., the text in the results section describing data presented in a specified subfigure) from a set of biomedical experimental research articles. We first demonstrate our state-of-the-art scientific discourse tagger on two scientific discourse tagging datasets and its transferability to new datasets. We then show the benefit of leveraging scientific discourse tags for downstream tasks such as claim-extraction and evidence fragment detection. Our work demonstrates the potential of using evidence fragments derived from figure spans for improving the quality of scientific claims by cataloging, indexing and reusing evidence fragments as independent documents.

@inproceedings{li2021discourse,
  title = {Discourse Tagging for Scientific Evidence Extraction},
  author = {Li, Xiangci and Burns, Gully and Peng, Nanyun},
  booktitle = {The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
  year = {2021}
}

Details

MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification

Wu Te-Lin, Shikhar Singh, Sayan Paul, Gully Burns, and Nanyun Peng, in The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), 2021.
Full Text Code Abstract BibTeX Details

We introduce a new dataset, MELINDA, for Multimodal Biomedical Experiment Method Classification. The dataset is collected in a fully automated distant supervision manner, where the labels are obtained from an existing curated database, and the actual contents are extracted from papers associated with each of the records in the database. We benchmark various state-of-the-art NLP and computer vision models, including unimodal models which only take either caption texts or images as inputs, and multimodal models. Our extensive experimental results show that multimodal models, despite outperforming other benchmarked models, require certain improvements especially a less-supervised way of grounding visual concepts with languages, and better transfer learning for low resource tasks.  We release our dataset and the benchmarks to facilitate future research in multimodal learning, especially to motivate targeted improvements for applications in scientific domains.

@inproceedings{wu2021melinda,
  title = {MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification},
  author = {Te-Lin, Wu and Singh, Shikhar and Paul, Sayan and Burns, Gully and Peng, Nanyun},
  booktitle = {The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)},
  year = {2021}
}

Details

GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction

Wasi Ahmad, Nanyun Peng, and Kai-Wei Chang, in The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), 2021.
Full Text Code Abstract BibTeX Details

Prevalent approaches in cross-lingual relation and event extraction use graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic representations such that models trained on one language can be applied to other languages. However, GCNs lack in modeling long-range dependencies or disconnected words in the dependency tree. To address this challenge, we propose to utilize the self-attention mechanism where we explicitly fuse structural information to learn the dependencies between words at different syntactic distances. We introduce GATE, a Graph Attention Transformer Encoder, and test its cross-lingual transferability on relation and event extraction tasks. We perform rigorous experiments on the widely used ACE05 dataset that includes three typologically different languages: English, Chinese, and Arabic. The evaluation results show that GATE outperforms three recently proposed methods by a large margin. Our detailed analysis reveals that due to the reliance on syntactic dependencies, GATE produces robust representations that facilitate transfer across languages.

@inproceedings{ahmad2021gate,
  author = {Ahmad, Wasi and Peng, Nanyun and Chang, Kai-Wei},
  title = {GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction},
  booktitle = {The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)},
  year = {2021}
}

Details

A Paragraph-level Multi-task Learning Model for Scientific Fact-Verification

Xiangci Li, Gully Burns, and Nanyun Peng, in Scientific Document Understanding Workshop at the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), 2021.
Full Text Code Abstract BibTeX Details

Even for domain experts, it is a non-trivial task to verify a scientific claim by providing supporting or refuting evidence rationales. The situation worsens as misinformation is proliferated on social media or news websites, manually or programmatically, at every moment. As a result, an automatic fact-verification tool becomes crucial for combating the spread of misinformation. In this work, we propose a novel, paragraph-level, multi-task learning model for the SciFact task by directly computing a sequence of contextualized sentence embeddings from a BERT model and jointly training the model on rationale selection and stance prediction.

@inproceedings{li2021paragraph,
  title = {A Paragraph-level Multi-task Learning Model for Scientific Fact-Verification},
  author = {Li, Xiangci and Burns, Gully and Peng, Nanyun},
  booktitle = {Scientific Document Understanding Workshop at the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)},
  year = {2021}
}

Details

2020

Content Planning for Neural Story Generation with Aristotelian Rescoring

Seraphina Goldfarb-Tarrant, Tuhin Chakrabarty, Ralph Weischedel, and Nanyun Peng, in the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
Full Text Slides Code Abstract BibTeX Details

Long-form narrative text generated from largelanguage models manages a fluent impersonation of human writing, but only at the localsentence level, and lacks structure or global cohesion. We posit that many of the problem of story generation can be addressed via high quality content planning, and present a systemthat focuses on how to learn good plot structures to guide story generation. We utilize a plot-generation language model along with an ensemble of rescoring models that each implement an aspect of good story-writing as detailed in Aristotle’s Poetics. We find that stories written with our more principled plot structure are both more relevant to a given prompt and higher quality than baselines that do not content plan, or that plan in an unprincipled way.

@inproceedings{goldfarb2020content,
  title = {Content Planning for Neural Story Generation with Aristotelian Rescoring},
  author = {Goldfarb-Tarrant, Seraphina and Chakrabarty, Tuhin and Weischedel, Ralph and Peng, Nanyun},
  booktitle = {the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  pages = {4319--4338},
  slideslive_id = {38939240},
  year = {2020}
}

Details

Generating similes effortlessly like a Pro: A Style Transfer Approach for Simile Generation

Tuhin Chakrabarty, Smaranda Muresan, and Nanyun Peng, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
Full Text Slides Code Abstract BibTeX Details

Literary tropes, from poetry to stories, are at the crux of human imagination and communication. Figurative language, such as a simile,goes beyond plain expressions to give readers new insights and inspirations. We tackle the problem of simile generation. Generating a simile requires proper understanding for effective mapping of properties between two concepts. To this end, we first propose a method to automatically construct a parallel corpus by transforming a large number of similes collected from Reddit to their literal counterpart using structured common sense knowledge. We then fine-tune a pretrained sequence to sequence model, BART (Lewis et al., 2019),on the literal-simile pairs to generate novel similes given a literal sentence. Experiments show that our approach generates 88% novel similes that do not share properties with the training data. Human evaluation on an independent set of literal statements shows that our model generates similes better than two literary experts 37% of the times, and three baseline systems including a recent metaphor generation model 71% of the times when compared pairwise. We also show how replacing literal sentences with similes from our best model in machine generated stories improves evocativeness and leads to better acceptance by human judges.

@inproceedings{chakrabarty-etal-2020-generating,
  title = {Generating similes effortlessly like a Pro: A Style Transfer Approach for Simile Generation},
  author = {Chakrabarty, Tuhin and Muresan, Smaranda and Peng, Nanyun},
  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  pages = {6455--6469},
  publisher = {Association for Computational Linguistics},
  slideslive_id = {38938962},
  year = {2020}
}

Details

Domain Knowledge Empowered Structured Neural Net for End-to-End Event Temporal Relation Extraction

Rujun Han, Yichao Zhou, and Nanyun Peng, in the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
Full Text Slides Code Abstract BibTeX Details

Extracting event temporal relations is a critical task for information extraction and plays an important role in natural language understanding. Prior systems leverage deep learning and pre-trained language models to improve the performance of the task. However, these systems often suffer from two shortcomings: 1) when performing maximum a posteriori (MAP) inference based on neural models, previous systems only used structured knowledge that is assumed to be absolutely correct, i.e., hard constraints; 2) biased predictions on dominant temporal relations when training with a limited amount of data. To address these issues, we propose a framework that enhances deep neural network with distributional constraints constructed by probabilistic domain knowledge. We solve the constrained inference problem via Lagrangian Relaxation and apply it to end-to-end event temporal relation extraction tasks. Experimental results show our framework is able to improve the baseline neural network models with strong statistical significance on two widely used datasets in news and clinical domains.

@inproceedings{han2020knowledge,
  title = {Domain Knowledge Empowered Structured Neural Net for End-to-End Event Temporal Relation Extraction},
  author = {Han, Rujun and Zhou, Yichao and Peng, Nanyun},
  booktitle = {the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  publisher = {Association for Computational Linguistics},
  pages = {5717--5729},
  slideslive_id = {38939236},
  year = {2020}
}

Details

TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions

Qiang Ning, Hao Wu, Rujun Han, Nanyun Peng, Matt Gardner, and Dan Roth, in the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
Full Text Code Abstract BibTeX Details

A critical part of reading is being able to understand the temporal relationships between events described in a passage of text, even when those relationships are not explicitly stated. However, current machine reading comprehension benchmarks have practically no questions that test temporal phenomena, so systems trained on these benchmarks have no capacity to answer questions such as "what happened before/after [some event]?" We introduce TORQUE, a new English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships. Results show that RoBERTa-large achieves an exact-match score of 51% on the test set of TORQUE, about 30% behind human performance.

@inproceedings{ning2020torque,
  title = {TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions},
  author = {Ning, Qiang and Wu, Hao and Han, Rujun and Peng, Nanyun and Gardner, Matt and Roth, Dan},
  booktitle = {the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  publisher = {Association for Computational Linguistics},
  pages = {1158--1172},
  slideslive_id = {38938807},
  year = {2020}
}

Details

STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation

Nader Akoury, Shufan Wang, Josh Whiting, Stephen Hood, Nanyun Peng, and Mohit Iyyer, in the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
Full Text Code Abstract BibTeX Details

Systems for story generation are asked to produce plausible and enjoyable stories given an input context. This task is underspecified, as a vast number of diverse stories can originate from a single input. The large output space makes it difficult to build and evaluate story generation models, as (1) existing datasets lack rich enough contexts to meaningfully guide models, and (2) existing evaluations (both crowdsourced and automatic) are unreliable for assessing long-form creative text. To address these issues, we introduce a dataset and evaluation platform built from STORIUM, an online collaborative storytelling community. Our author-generated dataset contains 6K lengthy stories (125M tokens) with fine-grained natural language annotations (e.g., character goals and attributes) interspersed throughout each narrative, forming a robust source for guiding models. We evaluate language models fine-tuned on our dataset by integrating them onto STORIUM, where real authors can query a model for suggested story continuations and then edit them. Automatic metrics computed over these edits correlate well with both user ratings of generated stories and qualitative feedback from semi-structured user interviews. We release both the STORIUM dataset and evaluation platform to spur more principled research into story generation.

@inproceedings{akoury2020storium,
  title = {STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation},
  author = {Akoury, Nader and Wang, Shufan and Whiting, Josh and Hood, Stephen and Peng, Nanyun and Iyyer, Mohit},
  booktitle = {the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  slideslive_id = {38939010},
  year = {2020}
}

Details

Towards Controllable Biases in Language Generation

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng, in the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)-Findings, long, 2020.
Full Text Poster Code Abstract BibTeX Details

We present a general approach towards controllable societal biases in natural language generation (NLG). Building upon the idea of adversarial triggers, we develop a method to induce societal biases in generated text when input prompts contain mentions of specific demographic groups. We then analyze two scenarios: 1) inducing negative biases for one demographic and positive biases for another demographic, and 2) equalizing biases between demographics. The former scenario enables us to detect the types of biases present in the model. Specifically, we show the effectiveness of our approach at facilitating bias analysis by finding topics that correspond to demographic inequalities in generated text and comparing the relative effectiveness of inducing biases for different demographics. The second scenario is useful for mitigating biases in downstream applications such as dialogue generation. In our experiments, the mitigation technique proves to be effective at equalizing the amount of biases across demographics while simultaneously generating less negatively biased text overall.

@inproceedings{sheng2020towards,
  title = {Towards Controllable Biases in Language Generation},
  author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun},
  booktitle = {the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)-Findings, long},
  year = {2020}
}

Details

Biomedical Event Extraction with Hierarchical Knowledge Graphs

Kung-Hsiang Huang, Mu Yang, and Nanyun Peng, in the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)-Findings, short, 2020.
Full Text Slides Code Abstract BibTeX Details

Biomedical event extraction is critical in understanding biomolecular interactions described in scientific corpus. One of the main challenges is to identify nested structured events that are associated with non-indicative trigger words. We propose to incorporate domain knowledge from Unified Medical Language System (UMLS) to a pre-trained language model via a hierarchical graph representation encoded by a proposed Graph Edgeconditioned Attention Networks (GEANet). To better recognize the trigger words, each sentence is first grounded to a sentence graph based on a jointly modeled hierarchical knowledge graph from UMLS. The grounded graphs are then propagated by GEANet, a novel graph neural networks for enhanced capabilities in inferring complex events. On BioNLP 2011 GENIA Event Extraction task, our approach achieved 1.41% F1 and 3.19% F1 improvements on all events and complex events, respectively. Ablation studies confirm the importance of GEANet and hierarchical KG.

@inproceedings{huang2020event,
  title = {Biomedical Event Extraction with Hierarchical Knowledge Graphs},
  author = {Huang, Kung-Hsiang and Yang, Mu and Peng, Nanyun},
  booktitle = {the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)-Findings, short},
  slideslive_id = {38940169},
  year = {2020}
}

Details

Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

Peifeng Wang, Nanyun Peng, Filip Ilievski, Pedro Szekely, and Xiang Ren, in the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)-Findings, 2020.
Full Text Code Abstract BibTeX Details

Commonsense question answering (QA) requires background knowledge which is not explicitly stated in a given context. Prior works use commonsense knowledge graphs (KGs) to obtain this knowledge for reasoning. However, relying entirely on these KGs may not suffice, considering their limited coverage and the contextual dependence of their knowledge. In this paper, we augment a general commonsense QA framework with a knowledgeable path generator. By extrapolating over existing paths in a KG with a state-of-the-art language model, our generator learns to connect a pair of entities in text with a dynamic, and potentially novel, multi-hop relational path. Such paths can provide structured evidence for solving commonsense questions without fine-tuning the path generator. Experiments on two datasets show the superiority of our method over previous works which fully rely on knowledge from KGs (with up to 6% improvement in accuracy), across various amounts of training data. Further evaluation suggests that the generated paths are typically interpretable, novel, and relevant to the task.

@inproceedings{wang2020connecting,
  title = {Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering},
  author = {Wang, Peifeng and Peng, Nanyun and Ilievski, Filip and Szekely, Pedro and Ren, Xiang},
  booktitle = {the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)-Findings},
  pages = {4129--4140},
  year = {2020}
}

Details

R³: Reverse, Retrieve, and Rank for Sarcasm Generation with Commonsense Knowledge

Tuhin Chakrabarty, Debanjan Ghosh, Smaranda Muresan, and Nanyun Peng, in the 2020 Annual Conference of the Association for Computational Linguistics (ACL), 2020.
Full Text Code BibTeX Details

@inproceedings{chakrabarty2020r,
  title = {R³: Reverse, Retrieve, and Rank for Sarcasm Generation with Commonsense Knowledge},
  author = {Chakrabarty, Tuhin and Ghosh, Debanjan and Muresan, Smaranda and Peng, Nanyun},
  booktitle = {the 2020 Annual Conference of the Association for Computational Linguistics (ACL)},
  year = {2020}
}

Details

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

Sarik Ghazarian, Ralph Weischedel, Aram Galstyan, and Nanyun Peng, in The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), 2020.
Full Text Code Abstract BibTeX Details

User engagement is a critical metric for evaluating the quality of open-domain dialogue systems. Prior work has focused on conversation-level engagement by using heuristically constructed features such as the number of turns and the total time of the conversation. In this paper, we investigate the possibility and efficacy of estimating utterance-level engagement and define a novel metric, predictive engagement, for automatic evaluation of open-domain dialogue systems. Our experiments demonstrate that (1) human annotators have high agreement on assessing utterance-level engagement scores; (2) conversation-level engagement scores can be predicted from properly aggregated utterance-level engagement scores. Furthermore, we show that the utterance-level engagement scores can be learned from data. These scores can be incorporated into automatic evaluation metrics for open-domain dialogue systems to improve the correlation with human judgements. This suggests that predictive engagement can be used as a real-time feedback for training better dialogue models.

@inproceedings{ghazarian2020predictive,
  title = {Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems},
  author = {Ghazarian, Sarik and Weischedel, Ralph and Galstyan, Aram and Peng, Nanyun},
  booktitle = {The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)},
  pages = {7789–-7796},
  year = {2020}
}

Details

Enabling Low-Resource Transfer Learning across COVID-19 Corpora by Combining Event-Extraction and Co-Training

Alexander Spangher, Nanyun Peng, Jonathan May, and Emilio Ferrara, in ACL 2020 Workshop on Natural Language Processing for COVID-19 (NLP-COVID), 2020.
Full Text BibTeX Details

@inproceedings{spangher2020enabling,
  title = {Enabling Low-Resource Transfer Learning across COVID-19 Corpora by Combining Event-Extraction and Co-Training},
  author = {Spangher, Alexander and Peng, Nanyun and May, Jonathan and Ferrara, Emilio},
  booktitle = {ACL 2020 Workshop on Natural Language Processing for COVID-19 (NLP-COVID)},
  year = {2020}
}

Details

Man is to person as woman is to location: Measuring gender bias in named entity recognition

Ninareh Mehrabi, Thamme Gowda, Fred Morstatter, Nanyun Peng, and Aram Galstyan, in 31st ACM Conference on Hypertext and Social Media (HT’20), 2020.
Full Text BibTeX Details

@inproceedings{mehrabi2020man,
  title = {Man is to person as woman is to location: Measuring gender bias in named entity recognition},
  author = {Mehrabi, Ninareh and Gowda, Thamme and Morstatter, Fred and Peng, Nanyun and Galstyan, Aram},
  booktitle = {31st ACM Conference on Hypertext and Social Media (HT’20)},
  year = {2020}
}

Details

2019

Joint Event and Temporal Relation Extraction with Shared Representations and Structured Prediction

Rujun Han, Qiang Ning, and Nanyun Peng, in 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.
Full Text Poster Code BibTeX Details

@inproceedings{han2019joint,
  title = {Joint Event and Temporal Relation Extraction with Shared Representations and Structured Prediction},
  author = {Han, Rujun and Ning, Qiang and Peng, Nanyun},
  booktitle = {2019 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2019}
}

Details

The Woman Worked as a Babysitter: On Biases in Language Generation

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng, in 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), short, 2019.
Full Text BibTeX Details

@inproceedings{sheng2019woman,
  title = {The Woman Worked as a Babysitter: On Biases in Language Generation},
  author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun},
  booktitle = {2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), short},
  year = {2019}
}

Details

Target Language-Aware Constrained Inference for Cross-lingual Dependency Parsing

Tao Meng, Nanyun Peng, and Kai-Wei Chang, in 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.
Full Text BibTeX Details

@inproceedings{meng2019target,
  title = {Target Language-Aware Constrained Inference for Cross-lingual Dependency Parsing},
  author = {Meng, Tao and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {2019 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2019}
}

Details

What Matters for Neural Cross-Lingual Named Entity Recognition: An Empirical Analysis

Xiaolei Huang, Jonathan May, and Nanyun Peng, in 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), short, 2019.
Full Text BibTeX Details

@inproceedings{huang2019matters,
  title = {What Matters for Neural Cross-Lingual Named Entity Recognition: An Empirical Analysis},
  author = {Huang, Xiaolei and May, Jonathan and Peng, Nanyun},
  booktitle = {2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), short},
  year = {2019}
}

Details

Do Nuclear Submarines Have Nuclear Captains? A Challenge Dataset for Commonsense Reasoning over Adjectives and Objects

James Mullenbach, Jonathan Gordon, Nanyun Peng, and Jonathan May, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), short, 2019.
Full Text BibTeX Details

@inproceedings{mullenbach2019nuclear,
  title = {Do Nuclear Submarines Have Nuclear Captains? A Challenge Dataset for Commonsense Reasoning over Adjectives and Objects},
  author = {Mullenbach, James and Gordon, Jonathan and Peng, Nanyun and May, Jonathan},
  booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), short},
  pages = {6054--6060},
  year = {2019}
}

Details

Deep Structured Neural Network for Event Temporal Relation Extraction

Rujun Han, I.-Hung Hsu, Mu Yang, Aram Galstyan, Ralph Weischedel, and Nanyun Peng, in The 2019 SIGNLL Conference on Computational Natural Language Learning (CoNLL), 2019.
Full Text Code BibTeX Details

@inproceedings{han2019deep,
  title = {Deep Structured Neural Network for Event Temporal Relation Extraction},
  author = {Han, Rujun and Hsu, I-Hung and Yang, Mu and Galstyan, Aram and Weischedel, Ralph and Peng, Nanyun},
  booktitle = {The 2019 SIGNLL Conference on Computational Natural Language Learning (CoNLL)},
  year = {2019}
}

Details

Cross-lingual Dependency Parsing with Unlabeled Auxiliary Languages

Wasi Uddin Ahmad, Zhisong Zhang, Xuezhe Ma, Kai-Wei Chang, and Nanyun Peng, in The 2019 SIGNLL Conference on Computational Natural Language Learning (CoNLL), 2019.
Full Text BibTeX Details

@inproceedings{ahmad2019cross,
  title = {Cross-lingual Dependency Parsing with Unlabeled Auxiliary Languages},
  author = {Ahmad, Wasi Uddin and Zhang, Zhisong and Ma, Xuezhe and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {The 2019 SIGNLL Conference on Computational Natural Language Learning (CoNLL)},
  year = {2019}
}

Details

Learning A Unified Named Entity Tagger From Multiple Partially Annotated Corpora For Efficient Adaptation

Xiao Huang, Li Dong, Elizabeth Boschee, and Nanyun Peng, in The 2019 SIGNLL Conference on Computational Natural Language Learning (CoNLL), 2019.
Full Text Code Abstract BibTeX Details

Named entity recognition (NER) identifies typed entity mentions in raw text. While the task is well-established, there is no universally used tagset: often, datasets are annotated for use in downstream applications and accordingly only cover a small set of entity types relevant to a particular task. For instance, in the biomedical domain, one corpus might annotate genes, another chemicals, and another diseases—despite the texts in each corpus containing references to all three types of entities. In this paper, we propose a deep structured model to integrate these “partially annotated” datasets to jointly identify all entity types appearing in the training corpora. By leveraging multiple datasets, the model can learn robust input representations; by building a joint structured model, it avoids potential conflicts caused by combining several models’ predictions at test time. Experiments show that the proposed model significantly outperforms strong multi-task learning baselines when training on multiple, partially annotated datasets and testing on datasets that contain tags from more than one of the training corpora

@inproceedings{huang2019learning,
  title = {Learning A Unified Named Entity Tagger From Multiple Partially Annotated Corpora For Efficient Adaptation},
  author = {Huang, Xiao and Dong, Li and Boschee, Elizabeth and Peng, Nanyun},
  booktitle = {The 2019 SIGNLL Conference on Computational Natural Language Learning (CoNLL)},
  year = {2019}
}

Details

Pun Generation with Surprise

He He, Nanyun Peng, and Percy Liang, in 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019), 2019.
Full Text BibTeX Details

@inproceedings{he2019pun,
  title = {Pun Generation with Surprise},
  author = {He, He and Peng, Nanyun and Liang, Percy},
  booktitle = {2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019)},
  volume = {1},
  year = {2019}
}

Details

On difficulties of cross-lingual transfer with order differences: A case study on dependency parsing

Wasi Uddin Ahmad, Zhisong Zhang, Xuezhe Ma, Eduard Hovy, Kai-Wei Chang, and Nanyun Peng, in Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019.
Full Text BibTeX Details

@inproceedings{ahmad2019difficulties,
  title = {On difficulties of cross-lingual transfer with order differences: A case study on dependency parsing},
  author = {Ahmad, Wasi Uddin and Zhang, Zhisong and Ma, Xuezhe and Hovy, Eduard and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year = {2019}
}

Details

Plan-And-Write: Towards Better Automatic Storytelling

Lili Yao, Nanyun Peng, Weischedel Ralph, Kevin Knight, Dongyan Zhao, and Rui Yan, in The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), 2019.
Full Text BibTeX Details

@inproceedings{yao2019plan,
  title = {Plan-And-Write: Towards Better Automatic Storytelling},
  author = {Yao, Lili and Peng, Nanyun and Ralph, Weischedel and Knight, Kevin and Zhao, Dongyan and Yan, Rui},
  booktitle = {The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)},
  year = {2019}
}

Details

Plan, Write, and Revise: an Interactive System for Open-Domain Story Generation

Seraphina Goldfarb-Tarrant, Haining Feng, and Nanyun Peng, in 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019), Demonstrations Track, 2019.
Full Text Video Code Abstract BibTeX Details

Story composition is a challenging problem for machines and even for humans. We present a neural narrative generation system that interacts with humans to generate stories. Our system has different levels of human interaction, which enables us to understand at what stage of story-writing human collaboration is most productive, both to improving story quality and human engagement in the writing process. We compare different varieties of interaction in story-writing, story-planning, and diversity controls under time constraints, and show that increased types of human collaboration at both planning and writing stages results in a 10-50% improvement in story quality as compared to less interactive baselines. We also show an accompanying increase in user engagement and satisfaction with stories as compared to our own less interactive systems and to previous turn-taking approaches to interaction. Finally, we find that humans tasked with collaboratively improving a particular characteristic of a story are in fact able to do so, which has implications for future uses of human-in-the-loop systems.

@inproceedings{goldfarb2019plan,
  title = {Plan, Write, and Revise: an Interactive System for Open-Domain Story Generation},
  author = {Goldfarb-Tarrant, Seraphina and Feng, Haining and Peng, Nanyun},
  booktitle = {2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019), Demonstrations Track},
  volume = {4},
  pages = {89--97},
  year = {2019}
}

Details

Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings

Sarik Ghazarian, Johnny Tian-Zheng Wei, Aram Galstyan, and Nanyun Peng, in 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019), NeuralGen Workshop, 2019.
Full Text BibTeX Details

@inproceedings{ghazarian2019better,
  title = {Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings},
  author = {Ghazarian, Sarik and Wei, Johnny Tian-Zheng and Galstyan, Aram and Peng, Nanyun},
  booktitle = {2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019), NeuralGen Workshop},
  year = {2019}
}

Details

Contextualized Word Embeddings Enhanced Event Temporal Relation Extraction for Story Understanding

Rujun Han, Mengyue Liang, Bashar Alhafni, and Nanyun Peng, in 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019), Workshop on Narrative Understanding, 2019.
Full Text BibTeX Details

@inproceedings{han2019contextualized,
  title = {Contextualized Word Embeddings Enhanced Event Temporal Relation Extraction for Story Understanding},
  author = {Han, Rujun and Liang, Mengyue and Alhafni, Bashar and Peng, Nanyun},
  booktitle = {2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019), Workshop on Narrative Understanding},
  year = {2019}
}

Details

Building deep learning models for evidence classification from the open access biomedical literature

Gully A. Burns, Xiangci Li, and Nanyun Peng, Database, 2019.
Full Text BibTeX Details

@article{burns2019building,
  title = {Building deep learning models for evidence classification from the open access biomedical literature},
  author = {Burns, Gully A and Li, Xiangci and Peng, Nanyun},
  journal = {Database},
  year = {2019},
  publisher = {Narnia}
}

Details

Espresso: A Fast End-to-end Neural Speech Recognition Toolkit

Yiming Wang, Tongfei Chen, Hainan Xu, Shuoyang Ding, Hang Lv, Yiwen Shao, Nanyun Peng, Lei Xie, Shinji Watanabe, and Sanjeev Khudanpur, in The 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019.
Full Text BibTeX Details

@inproceedings{wang2019espresso,
  title = {Espresso: A Fast End-to-end Neural Speech Recognition Toolkit},
  author = {Wang, Yiming and Chen, Tongfei and Xu, Hainan and Ding, Shuoyang and Lv, Hang and Shao, Yiwen and Peng, Nanyun and Xie, Lei and Watanabe, Shinji and Khudanpur, Sanjeev},
  booktitle = {The 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year = {2019}
}

Details

Evaluating and Enhancing the Robustness of Retrieval-Based Dialogue Systems with Adversarial Examples

Jia Li, Chongyang Tao, Nanyun Peng, Wei Wu, Dongyan Zhao, and Rui Yan, in CCF International Conference on Natural Language Processing and Chinese Computing, 2019.
Full Text BibTeX Details

@inproceedings{li2019evaluating,
  title = {Evaluating and Enhancing the Robustness of Retrieval-Based Dialogue Systems with Adversarial Examples},
  author = {Li, Jia and Tao, Chongyang and Peng, Nanyun and Wu, Wei and Zhao, Dongyan and Yan, Rui},
  booktitle = {CCF International Conference on Natural Language Processing and Chinese Computing},
  pages = {142--154},
  year = {2019},
  organization = {Springer}
}

Details

Debiasing Community Detection: The Importance of Lowly-Connected Nodes

Ninareh Mehrabi, Fred Morstatter, Nanyun Peng, and Aram Galstyan, in The 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2019), 2019.
Full Text BibTeX Details

@inproceedings{mehrabi2019debiasing,
  title = {Debiasing Community Detection: The Importance of Lowly-Connected Nodes},
  author = {Mehrabi, Ninareh and Morstatter, Fred and Peng, Nanyun and Galstyan, Aram},
  booktitle = {The 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2019)},
  year = {2019}
}

Details

2018

Stack-pointer networks for dependency parsing

Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng, Graham Neubig, and Eduard Hovy, in The 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), 2018.
Full Text BibTeX Details

@inproceedings{ma2018stack,
  title = {Stack-pointer networks for dependency parsing},
  author = {Ma, Xuezhe and Hu, Zecong and Liu, Jingzhou and Peng, Nanyun and Neubig, Graham and Hovy, Eduard},
  booktitle = {The 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018)},
  volume = {1},
  year = {2018}
}

Details

Scalable Construction and Reasoning of Massive Knowledge Bases

Xiang Ren, Nanyun Peng, and William Yang Wang, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorial Abstracts, 2018.
Full Text BibTeX Details

@inproceedings{ren2018scalable,
  title = {Scalable Construction and Reasoning of Massive Knowledge Bases},
  author = {Ren, Xiang and Peng, Nanyun and Wang, William Yang},
  booktitle = {Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorial Abstracts},
  pages = {10--16},
  year = {2018}
}

Details

Style Transfer in Text: Exploration and Evaluation

Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan, in Proceedings of The Thirty-Second Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence (AAAI), 2018.
Full Text BibTeX Details

@inproceedings{fu2018style,
  title = {Style Transfer in Text: Exploration and Evaluation},
  author = {Fu, Zhenxin and Tan, Xiaoye and Peng, Nanyun and Zhao, Dongyan and Yan, Rui},
  booktitle = {Proceedings of The Thirty-Second Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence (AAAI)},
  year = {2018}
}

Details

Towards controllable story generation

Nanyun Peng, Marjan Ghazvininejad, Jonathan May, and Kevin Knight, in NAACL Workshop, 2018.
Full Text BibTeX Details

@inproceedings{peng2018towards,
  title = {Towards controllable story generation},
  author = {Peng, Nanyun and Ghazvininejad, Marjan and May, Jonathan and Knight, Kevin},
  booktitle = {NAACL Workshop},
  year = {2018}
}

Details

Learning to Converse with Noisy Data: Generation with Calibration.

Mingyue Shang, Zhenxin Fu, Nanyun Peng, Yansong Feng, Dongyan Zhao, and Rui Yan, in IJCAI, 2018.
Full Text BibTeX Details

@inproceedings{shang2018learning,
  title = {Learning to Converse with Noisy Data: Generation with Calibration.},
  author = {Shang, Mingyue and Fu, Zhenxin and Peng, Nanyun and Feng, Yansong and Zhao, Dongyan and Yan, Rui},
  booktitle = {IJCAI},
  pages = {4338--4344},
  year = {2018}
}

Details

2017

Supplementary results for named entity recognition on Chinese social media with an updated dataset

Nanyun Peng and Mark Dredze, Tech. Rep., 2017.[Online], 2017.
Full Text BibTeX Details

@techreport{peng2017supplementary,
  title = {Supplementary results for named entity recognition on Chinese social media with an updated dataset},
  author = {Peng, Nanyun and Dredze, Mark},
  year = {2017},
  institution = {Tech. Rep., 2017.[Online]}
}

Details

Multi-task multi-domain representation learning for sequence tagging

Nanyun Peng and Mark Dredze, in Proceedings of the 2nd Workshop on Representation Learning for NLP, 2017.
Full Text BibTeX Details

@inproceedings{peng2017multi,
  title = {Multi-task multi-domain representation learning for sequence tagging},
  author = {Peng, Nanyun and Dredze, Mark},
  booktitle = {Proceedings of the 2nd Workshop on Representation Learning for NLP},
  year = {2017}
}

Details

A multi-task learning approach to adapting bilingual word embeddings for cross-lingual named entity recognition

Dingquan Wang, Nanyun Peng, and Kevin Duh, in Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2017.
Full Text BibTeX Details

@inproceedings{wang2017multi,
  title = {A multi-task learning approach to adapting bilingual word embeddings for cross-lingual named entity recognition},
  author = {Wang, Dingquan and Peng, Nanyun and Duh, Kevin},
  booktitle = {Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},
  pages = {383--388},
  year = {2017}
}

Details

Cross-sentence N-ary Relation Extraction with Graph LSTMs

Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih, Transactions of the Association of Computational Linguistics, 2017.
Full Text BibTeX Details

@article{peng2017cross,
  title = {Cross-sentence N-ary Relation Extraction with Graph LSTMs},
  author = {Peng, Nanyun and Poon, Hoifung and Quirk, Chris and Toutanova, Kristina and Yih, Wen-tau},
  journal = {Transactions of the Association of Computational Linguistics},
  year = {2017}
}

Details

Jointly Learning Representations for Low-Resource Information Extraction

Nanyun Peng and others, PhD thesis, 2017.
Full Text BibTeX Details

@phdthesis{peng2017jointly,
  title = {Jointly Learning Representations for Low-Resource Information Extraction},
  author = {Peng, Nanyun and others},
  year = {2017},
  school = {Ph. D. thesis, Johns Hopkins University}
}

Details

2016

Graph long short term memory for syntactic relationship discovery

Christopher Brian Quirk, Kristina Nikolova Toutanova, Wen-tau Yih, Hoifung Poon, and Nanyun Peng,
BibTeX Details

@misc{quirk2016graph,
  title = {Graph long short term memory for syntactic relationship discovery},
  author = {Quirk, Christopher Brian and Toutanova, Kristina Nikolova and Yih, Wen-tau and Poon, Hoifung and Peng, Nanyun},
  year = {2016}
}

Details

Improving named entity recognition for chinese social media with word segmentation representation learning

Nanyun Peng and Mark Dredze, in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016.
Full Text BibTeX Details

@inproceedings{peng2016improving,
  title = {Improving named entity recognition for chinese social media with word segmentation representation learning},
  author = {Peng, Nanyun and Dredze, Mark},
  booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics},
  year = {2016}
}

Details

2015

Modeling word forms using latent underlying morphs and phonology

Ryan Cotterell, Nanyun Peng, and Jason Eisner, Transactions of the Association of Computational Linguistics, 2015.
Full Text BibTeX Details

@article{cotterell2015modeling,
  title = {Modeling word forms using latent underlying morphs and phonology},
  author = {Cotterell, Ryan and Peng, Nanyun and Eisner, Jason},
  journal = {Transactions of the Association of Computational Linguistics},
  volume = {3},
  number = {1},
  year = {2015}
}

Details

A chinese concrete nlp pipeline

Nanyun Peng, Francis Ferraro, Mo Yu, Nicholas Andrews, Jay DeYoung, Max Thomas, Matt Gormley, Travis Wolfe, Craig Harman, Benjamin Van Durme, and others, North American Chapter of the Association for Computational Linguistics (NAACL), Demonstration Session, 2015.
BibTeX Details

@article{peng2015chinese,
  title = {A chinese concrete nlp pipeline},
  author = {Peng, Nanyun and Ferraro, Francis and Yu, Mo and Andrews, Nicholas and DeYoung, Jay and Thomas, Max and Gormley, Matt and Wolfe, Travis and Harman, Craig and Van Durme, Benjamin and others},
  journal = {North American Chapter of the Association for Computational Linguistics (NAACL), Demonstration Session},
  year = {2015}
}

Details

HLTCOE Participation in TAC KBP 2015: Cold Start and TEDL

Taneeya Satyapanich, Tim Finin, Paul McNamee, James Mayfield, Doug Oard, Nanyun Peng, Ning Gao, Yiu-Chang Lin, Joshi MacKin, and Tim Dowd, UMBC Faculty Collection, 2015.
BibTeX Details

@article{satyapanich2015hltcoe,
  title = {HLTCOE Participation in TAC KBP 2015: Cold Start and TEDL},
  author = {Satyapanich, Taneeya and Finin, Tim and McNamee, Paul and Mayfield, James and Oard, Doug and Peng, Nanyun and Gao, Ning and Lin, Yiu-Chang and MacKin, Joshi and Dowd, Tim},
  journal = {UMBC Faculty Collection},
  year = {2015},
  publisher = {National Institute of Standards and Technology}
}

Details

HLTCOE participation in TAC KBP 2015: Cold start and TEDL

Tim Finin, Dawn Lawrie, Paul McNamee, James Mayfield, Doug Oard, Nanyun Peng, Ning Gao, Yiu-Chang Lin, Joshi MacKin, Tim Dowd, and others, in Eighth Text Analysis Conference, 2015.
BibTeX Details

@inproceedings{finin2015hltcoe,
  title = {HLTCOE participation in TAC KBP 2015: Cold start and TEDL},
  author = {Finin, Tim and Lawrie, Dawn and McNamee, Paul and Mayfield, James and Oard, Doug and Peng, Nanyun and Gao, Ning and Lin, Yiu-Chang and MacKin, Joshi and Dowd, Tim and others},
  booktitle = {Eighth Text Analysis Conference},
  year = {2015}
}

Details

Dual decomposition inference for graphical models over strings

Nanyun Peng, Ryan Cotterell, and Jason Eisner, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015.
Full Text BibTeX Details

@inproceedings{peng2015dual,
  title = {Dual decomposition inference for graphical models over strings},
  author = {Peng, Nanyun and Cotterell, Ryan and Eisner, Jason},
  booktitle = {Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing},
  pages = {917--927},
  year = {2015}
}

Details

Named entity recognition for chinese social media with jointly trained embeddings

Nanyun Peng and Mark Dredze, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015.
Full Text BibTeX Details

@inproceedings{peng2015named,
  title = {Named entity recognition for chinese social media with jointly trained embeddings},
  author = {Peng, Nanyun and Dredze, Mark},
  booktitle = {Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing},
  pages = {548--554},
  year = {2015}
}

Details

An Empirical Study of Chinese Name Matching and Applications

Nanyun Peng, Mo Yu, and Mark Dredze, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), 2015.
BibTeX Details

@inproceedings{peng2015empirical,
  title = {An Empirical Study of Chinese Name Matching and Applications},
  author = {Peng, Nanyun and Yu, Mo and Dredze, Mark},
  booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2015}
}

Details

A concrete chinese NLP pipeline

Nanyun Peng, Francis Ferraro, Mo Yu, Nicholas Andrews, Jay DeYoung, Max Thomas, Matthew R. Gormley, Travis Wolfe, Craig Harman, Benjamin Van Durme, and others, in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, 2015.
BibTeX Details

@inproceedings{peng2015concrete,
  title = {A concrete chinese NLP pipeline},
  author = {Peng, Nanyun and Ferraro, Francis and Yu, Mo and Andrews, Nicholas and DeYoung, Jay and Thomas, Max and Gormley, Matthew R and Wolfe, Travis and Harman, Craig and Van Durme, Benjamin and others},
  booktitle = {Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations},
  pages = {86--90},
  year = {2015}
}

Details

2014

Stochastic Contextual Edit Distance and Probabilistic FSTs

Ryan Cotterell, Nanyun Peng, and Jason Eisner, in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014.
Full Text BibTeX Details

@inproceedings{cotterell2014stochastic,
  title = {Stochastic Contextual Edit Distance and Probabilistic FSTs},
  author = {Cotterell, Ryan and Peng, Nanyun and Eisner, Jason},
  booktitle = {Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics},
  year = {2014}
}

Details

Learning polylingual topic models from code-switched social media documents

Nanyun Peng, Yiming Wang, and Mark Dredze, in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2014.
Full Text BibTeX Details

@inproceedings{peng2014learning,
  title = {Learning polylingual topic models from code-switched social media documents},
  author = {Peng, Nanyun and Wang, Yiming and Dredze, Mark},
  booktitle = {Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
  pages = {674--679},
  year = {2014}
}

Details

2012

Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

Wan-Yu Lin, Nanyun Peng, Chun-Chao Yen, and Shou-de Lin, in Proceedings of the ACL 2012 System Demonstrations, 2012.
BibTeX Details

@inproceedings{lin2012online,
  title = {Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information},
  author = {Lin, Wan-Yu and Peng, Nanyun and Yen, Chun-Chao and Lin, Shou-de},
  booktitle = {Proceedings of the ACL 2012 System Demonstrations},
  pages = {145--150},
  year = {2012}
}

Details

Exploiting latent information to predict diffusions of novel topics on social networks

Tsung-Ting Kuo, San-Chuan Hung, Wei-Shih Lin, Nanyun Peng, Shou-De Lin, and Wei-Fen Lin, in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2012.
BibTeX Details

@inproceedings{kuo2012exploiting,
  title = {Exploiting latent information to predict diffusions of novel topics on social networks},
  author = {Kuo, Tsung-Ting and Hung, San-Chuan and Lin, Wei-Shih and Peng, Nanyun and Lin, Shou-De and Lin, Wei-Fen},
  booktitle = {Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
  pages = {344--348},
  year = {2012}
}

Details

On convergence rate of concave-convex procedure

Ian E. H. Yen, Nanyun Peng, Po-Wei Wang, and Shou-De Lin, in Proceedings of the NIPS 2012 Optimization Workshop, 2012.
BibTeX Details

@inproceedings{yen2012convergence,
  title = {On convergence rate of concave-convex procedure},
  author = {Yen, Ian EH and Peng, Nanyun and Wang, Po-Wei and Lin, Shou-De},
  booktitle = {Proceedings of the NIPS 2012 Optimization Workshop},
  pages = {31--35},
  year = {2012}
}

Details

My Google Scholar

Preprint

2025

2024

2023

2022

2021