Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

2022-06-16Zero-Shot Video Question Answer Question Answering Fill Mask Zeroshot Video Question Answer Masked Language Modeling Video Question Answering Visual Question Answering (VQA)Zero-Shot Learning TGIF-Frame Language Modelling Visual Question Answering

Paper PDF Code Code(official)Code

Abstract

Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models are publicly available at https://github.com/antoyang/FrozenBiLM.

Results

Task	Dataset	Metric	Value	Model
Zero-Shot Learning	iVQA	Accuracy	0.268	FrozenBiLM
Zero-Shot Learning	LSMDC	Accuracy	51.5	FrozenBiLM
Question Answering	MSVD-QA	Accuracy	33.8	FrozenBiLM
Question Answering	TGIF-QA	Accuracy	41.9	FrozenBiLM
Question Answering	TVQA	Accuracy	59.7	FrozenBiLM (with speech)
Question Answering	TVQA	Accuracy	29.7	FrozenBILM (no speech)
Question Answering	EgoSchema (fullset)	Accuracy	26.9	FrozenBiLM
Question Answering	ActivityNet-QA	Accuracy	24.7	FrozenBiLM
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.548	FrozenBiLM
Visual Question Answering (VQA)	MSRVTT-QA	Accuracy	0.47	FrozenBiLM
Video Question Answering	TVQA	Accuracy	82	FrozenBiLM
Video Question Answering	ActivityNet-QA	Accuracy	43.2	FrozenBiLM
Video Question Answering	ActivityNet-QA	Accuracy	25.9	FrozenBiLM (0-shot)
Video Question Answering	MSRVTT-QA	Accuracy	47	FrozenBiLM
Video Question Answering	MSRVTT-QA	Accuracy	16.7	FrozenBiLM (0-shot)
Video Question Answering	iVQA	Accuracy	39.6	FrozenBiLM
Video Question Answering	iVQA	Accuracy	26.8	FrozenBiLM (0-shot)
Video Question Answering	How2QA	Accuracy	86.7	FrozenBiLM
Video Question Answering	How2QA	Accuracy	58.4	FrozenBiLM (0-shot)
Video Question Answering	MSVD-QA	Accuracy	33.8	FrozenBiLM
Video Question Answering	TGIF-QA	Accuracy	41.9	FrozenBiLM
Video Question Answering	TVQA	Accuracy	59.7	FrozenBiLM (with speech)
Video Question Answering	TVQA	Accuracy	29.7	FrozenBILM (no speech)
Video Question Answering	EgoSchema (fullset)	Accuracy	26.9	FrozenBiLM
Video Question Answering	ActivityNet-QA	Accuracy	24.7	FrozenBiLM
Visual Question Answering	MSVD-QA	Accuracy	0.548	FrozenBiLM
Visual Question Answering	MSRVTT-QA	Accuracy	0.47	FrozenBiLM

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Abstract

Results

Related Papers

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Abstract

Results

Related Papers