Share this page:

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

Da Yin, Liunian Harold Li, Ziniu Hu, Nanyun Peng, and Kai-Wei Chang, in The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.

Download the full text


Abstract

Commonsense is defined as the knowledge on which everyone agrees. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenes of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models’ ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition.



Bib Entry

@inproceedings{yin2021broaden,
  title = {Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning},
  author = {Yin, Da and Li, Liunian Harold and Hu, Ziniu and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year = {2021}
}

Related Publications

  1. Zero-shot Commonsense Question Answering with Cloze Translation and Consistency Optimization

    Zi-Yi Dou and Nanyun Peng, in The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI), 2022.
    Full Text Code Abstract BibTeX Details
    Commonsense question answering (CQA) aims to test if models can answer questions regarding commonsense knowledge that everyone knows. Prior works that incorporate external knowledge bases have shown promising results, but knowledge bases are expensive to construct and are often limited to a fixed set of relations. In this paper, we instead focus on better utilizing the implicit knowledge stored in pre-trained language models. While researchers have found that the knowledge embedded in pre-trained language models can be extracted by having them fill in the blanks of carefully designed prompts for relation extraction and text classification, it remains unclear if we can adopt this paradigm in CQA where the inputs and outputs take much more flexible forms. To this end, we investigate four translation methods that can translate natural questions into cloze-style sentences to better solicit commonsense knowledge from language models, including a syntactic-based model, an unsupervised neural model, and two supervised neural models. In addition, to combine the different translation methods, we propose to encourage consistency among model predictions on different translated questions with unlabeled data. We demonstrate the effectiveness of our methods on three CQA datasets in zero-shot settings. We show that our methods are complementary to a knowledge base improved model, and combining them can lead to state-of-the-art zero-shot performance. Analyses also reveal distinct characteristics of the different cloze translation methods and provide insights on why combining them can lead to great improvements.
    @inproceedings{dou2022improving,
      title = {Zero-shot Commonsense Question Answering with Cloze Translation and Consistency Optimization},
      author = {Dou, Zi-Yi and Peng, Nanyun},
      booktitle = {The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI)},
      year = {2022}
    }
    
    Details
  2. Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

    Da Yin, Liunian Harold Li, Ziniu Hu, Nanyun Peng, and Kai-Wei Chang, in The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
    Full Text Video Code Abstract BibTeX Details
    Commonsense is defined as the knowledge on which everyone agrees. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenes of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models’ ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition.
    @inproceedings{yin2021broaden,
      title = {Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning},
      author = {Yin, Da and Li, Liunian Harold and Hu, Ziniu and Peng, Nanyun and Chang, Kai-Wei},
      booktitle = {The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
      year = {2021}
    }
    
    Details
  3. COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences

    Shikhar Singh, Nuan Wen, Yu Hou, Pegah Alipoormolabashi, Te-lin Wu, Xuezhe Ma, and Nanyun Peng, in Proceedings of Findings of the Conference of the 59th Annual Meeting of the Association for Computational Linguistics (ACL-Findings), 2021.
    Full Text Code Abstract BibTeX Details
    Commonsense reasoning is intuitive for humans but has been a long-term challenge for artificial intelligence (AI). Recent advancements in pretrained language models have shown promising results on several commonsense benchmark datasets. However, the reliability and comprehensiveness of these benchmarks towards assessing model’s commonsense reasoning ability remains unclear. To this end, we introduce a new commonsense reasoning benchmark dataset comprising natural language true/false statements, with each sample paired with its complementary counterpart, resulting in 4k sentence pairs. We propose a pairwise accuracy metric to reliably measure an agent’s ability to perform commonsense reasoning over a given situation. The dataset is crowdsourced and enhanced with an adversarial model-in-the-loop setup to incentivize challenging samples. To facilitate a systematic analysis of commonsense capabilities, we design our dataset along the dimensions of knowledge domains, reasoning scenarios and numeracy. Experimental results demonstrate that our strongest baseline (UnifiedQA-3B), after fine-tuning, achieves  71% standard accuracy and  51% pairwise accuracy, well below human performance ( 95% for both metrics).
    @inproceedings{sw2021com,
      title = {COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences},
      author = {Singh, Shikhar and Wen, Nuan and Hou, Yu and Alipoormolabashi, Pegah and Wu, Te-lin and Ma, Xuezhe and Peng, Nanyun},
      booktitle = {Proceedings of Findings of the Conference of the 59th Annual Meeting of the Association for Computational Linguistics (ACL-Findings)},
      year = {2021}
    }
    
    Details
  4. Identifying Distributional Perspective Differences from Colingual Groups

    Yufei Tian, Tuhin Chakrabarty, Fred Morstatter, and Nanyun Peng, in NAACL 2021 Workshop of Social NLP, 2021.
    Full Text Code Abstract BibTeX Details
    Perspective differences exist among different cultures or languages. A lack of mutual understanding among different groups about their perspectives on specific values or events may lead to uninformed decisions or biased opinions. Automatically understanding the group perspectives can provide essential background for many downstream applications of natural language processing techniques. In this paper, we study colingual groups and use language corpora as a proxy to identify their distributional perspectives. We present a novel computational approach to learn shared understandings, and benchmark our method by building culturally-aware models for the English, Chinese, and Japanese languages. On a held out set of diverse topics including marriage, corruption, democracy, our model achieves high correlation with human judgements regarding intra-group values and inter-group differences.
    @inproceedings{tian2021identifying,
      title = {Identifying Distributional Perspective Differences from Colingual Groups},
      author = {Tian, Yufei and Chakrabarty, Tuhin and Morstatter, Fred and Peng, Nanyun},
      booktitle = {NAACL 2021 Workshop of Social NLP},
      year = {2021}
    }
    
    Details
  5. Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

    Peifeng Wang, Nanyun Peng, Filip Ilievski, Pedro Szekely, and Xiang Ren, in the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)-Findings, 2020.
    Full Text Code Abstract BibTeX Details
    Commonsense question answering (QA) requires background knowledge which is not explicitly stated in a given context. Prior works use commonsense knowledge graphs (KGs) to obtain this knowledge for reasoning. However, relying entirely on these KGs may not suffice, considering their limited coverage and the contextual dependence of their knowledge. In this paper, we augment a general commonsense QA framework with a knowledgeable path generator. By extrapolating over existing paths in a KG with a state-of-the-art language model, our generator learns to connect a pair of entities in text with a dynamic, and potentially novel, multi-hop relational path. Such paths can provide structured evidence for solving commonsense questions without fine-tuning the path generator. Experiments on two datasets show the superiority of our method over previous works which fully rely on knowledge from KGs (with up to 6% improvement in accuracy), across various amounts of training data. Further evaluation suggests that the generated paths are typically interpretable, novel, and relevant to the task.
    @inproceedings{wang2020connecting,
      title = {Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering},
      author = {Wang, Peifeng and Peng, Nanyun and Ilievski, Filip and Szekely, Pedro and Ren, Xiang},
      booktitle = {the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)-Findings},
      pages = {4129--4140},
      year = {2020}
    }
    
    Details
  6. Do Nuclear Submarines Have Nuclear Captains? A Challenge Dataset for Commonsense Reasoning over Adjectives and Objects

    James Mullenbach, Jonathan Gordon, Nanyun Peng, and Jonathan May, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), short, 2019.
    Full Text BibTeX Details
    @inproceedings{mullenbach2019nuclear,
      title = {Do Nuclear Submarines Have Nuclear Captains? A Challenge Dataset for Commonsense Reasoning over Adjectives and Objects},
      author = {Mullenbach, James and Gordon, Jonathan and Peng, Nanyun and May, Jonathan},
      booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), short},
      pages = {6054--6060},
      year = {2019}
    }
    
    Details