Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

2020-12-01ICCV 2021 10Question Answering Video Question Answering Question Generation Visual Question Answering (VQA)Zero-Shot Learning Visual Question Answering

Paper PDF Code(official)

Abstract

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations. Our code, datasets and trained models are available at https://antoyang.github.io/just-ask.html.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.463	Just Ask
Visual Question Answering (VQA)	MSRVTT-QA	Accuracy	0.415	Just Ask
Video Question Answering	ActivityNet-QA	Accuracy	38.9	Just Ask (fine-tune)
Video Question Answering	ActivityNet-QA	Accuracy	12.2	Just Ask (0-shot)
Video Question Answering	iVQA	Accuracy	35.4	Just Ask (fine-tune)
Video Question Answering	iVQA	Accuracy	12.2	Just Ask (0-shot)
Video Question Answering	How2QA	Accuracy	84.4	Just Ask
Video Question Answering	How2QA	Accuracy	51.1	Just Ask (0-shot)
Video Question Answering	VideoQA	Accuracy	15.6	Just Ask (fine-tune)
Visual Question Answering	MSVD-QA	Accuracy	0.463	Just Ask
Visual Question Answering	MSRVTT-QA	Accuracy	0.415	Just Ask

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Abstract

Results

Related Papers

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Abstract

Results

Related Papers