TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Just Ask: Learning to Answer Questions from Millions of Na...

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

2020-12-01ICCV 2021 10Question AnsweringVideo Question AnsweringQuestion GenerationVisual Question Answering (VQA)Zero-Shot LearningVisual Question Answering
PaperPDFCode(official)

Abstract

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations. Our code, datasets and trained models are available at https://antoyang.github.io/just-ask.html.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)MSVD-QAAccuracy0.463Just Ask
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.415Just Ask
Video Question AnsweringActivityNet-QAAccuracy38.9Just Ask (fine-tune)
Video Question AnsweringActivityNet-QAAccuracy12.2Just Ask (0-shot)
Video Question AnsweringiVQAAccuracy35.4Just Ask (fine-tune)
Video Question AnsweringiVQAAccuracy12.2Just Ask (0-shot)
Video Question AnsweringHow2QAAccuracy84.4Just Ask
Video Question AnsweringHow2QAAccuracy51.1Just Ask (0-shot)
Video Question AnsweringVideoQAAccuracy15.6Just Ask (fine-tune)
Visual Question AnsweringMSVD-QAAccuracy0.463Just Ask
Visual Question AnsweringMSRVTT-QAAccuracy0.415Just Ask

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16