TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Open-vocabulary Video Question Answering: A New Benchmark ...

Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models

Dohwan Ko, Ji Soo Lee, Miso Choi, Jaewon Chu, Jihwan Park, Hyunwoo J. Kim

2023-08-18ICCV 2023 1Question AnsweringVideo Question AnsweringVisual Question Answering (VQA)TGIF-FrameMultiple-choice
PaperPDFCode(official)

Abstract

Video Question Answering (VideoQA) is a challenging task that entails complex multi-modal reasoning. In contrast to multiple-choice VideoQA which aims to predict the answer given several options, the goal of open-ended VideoQA is to answer questions without restricting candidate answers. However, the majority of previous VideoQA models formulate open-ended VideoQA as a classification task to classify the video-question pairs into a fixed answer set, i.e., closed-vocabulary, which contains only frequent answers (e.g., top-1000 answers). This leads the model to be biased toward only frequent answers and fail to generalize on out-of-vocabulary answers. We hence propose a new benchmark, Open-vocabulary Video Question Answering (OVQA), to measure the generalizability of VideoQA models by considering rare and unseen answers. In addition, in order to improve the model's generalization power, we introduce a novel GNN-based soft verbalizer that enhances the prediction on rare and unseen answers by aggregating the information from their similar words. For evaluation, we introduce new baselines by modifying the existing (closed-vocabulary) open-ended VideoQA models and improve their performances by further taking into account rare and unseen answers. Our ablation studies and qualitative analyses demonstrate that our GNN-based soft verbalizer further improves the model performance, especially on rare and unseen answers. We hope that our benchmark OVQA can serve as a guide for evaluating the generalizability of VideoQA models and inspire future research. Code is available at https://github.com/mlvlab/OVQA.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.47FrozenBiLM+
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.418JustAsk+
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.395All-in-one+
Visual Question Answering (VQA)MSVD-QAAccuracy0.558FrozenBiLM+
Visual Question Answering (VQA)MSVD-QAAccuracy0.495VIOLET+
Visual Question Answering (VQA)MSVD-QAAccuracy0.477JustAsk+
Visual Question Answering (VQA)MSVD-QAAccuracy0.438All-in-one+
Video Question AnsweringActivityNet-QAAccuracy44.8FrozenBiLM+
Video Question AnsweringActivityNet-QAAccuracy40All-in-one+
Video Question AnsweringActivityNet-QAAccuracy39.7VIOLET+

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16