TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

Da Yin, Liunian Harold Li, Ziniu Hu, Nanyun Peng, Kai-Wei Chang

2021-09-14EMNLP 2021 11Cultural Vocal Bursts Intensity PredictionVisual Commonsense Reasoning
PaperPDFCode(official)

Abstract

Commonsense is defined as the knowledge that is shared by everyone. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenarios of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models' ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard multimodal commonsense benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition. Dataset and code are released at https://github.com/WadeYin9712/GD-VCR.

Results

TaskDatasetMetricValueModel
Visual ReasoningGD-VCRAccuracy88.84Human
Visual ReasoningGD-VCRAccuracy59.99ViLBERT
Visual ReasoningGD-VCRGap (West)-7.28ViLBERT
Visual ReasoningGD-VCRAccuracy53.95VisualBERT
Visual ReasoningGD-VCRGap (West)-10.42VisualBERT
Visual ReasoningGD-VCRAccuracy35.33Text-only BERT

Related Papers

Compositional Image-Text Matching and Retrieval by Grounding Entities2025-05-04Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing2025-01-15How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey2024-12-11Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor2024-12-08Improving Visual Commonsense in Language Models via Multiple Image Generation2024-06-19Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?2024-06-11ALGO: Object-Grounded Visual Commonsense Reasoning for Open-World Egocentric Action Recognition2024-06-09Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models2024-06-03