TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Zero-Shot Video Question Answering via Frozen Bidirectiona...

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

2022-06-16Zero-Shot Video Question AnswerQuestion AnsweringFill MaskZeroshot Video Question AnswerMasked Language ModelingVideo Question AnsweringVisual Question Answering (VQA)Zero-Shot LearningTGIF-FrameLanguage ModellingVisual Question Answering
PaperPDFCodeCode(official)Code

Abstract

Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models are publicly available at https://github.com/antoyang/FrozenBiLM.

Results

TaskDatasetMetricValueModel
Zero-Shot LearningiVQAAccuracy0.268FrozenBiLM
Zero-Shot LearningLSMDCAccuracy51.5FrozenBiLM
Question AnsweringMSVD-QAAccuracy33.8FrozenBiLM
Question AnsweringTGIF-QAAccuracy41.9FrozenBiLM
Question AnsweringTVQAAccuracy59.7FrozenBiLM (with speech)
Question AnsweringTVQAAccuracy29.7FrozenBILM (no speech)
Question AnsweringEgoSchema (fullset)Accuracy26.9FrozenBiLM
Question AnsweringActivityNet-QAAccuracy24.7FrozenBiLM
Visual Question Answering (VQA)MSVD-QAAccuracy0.548FrozenBiLM
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.47FrozenBiLM
Video Question AnsweringTVQAAccuracy82FrozenBiLM
Video Question AnsweringActivityNet-QAAccuracy43.2FrozenBiLM
Video Question AnsweringActivityNet-QAAccuracy25.9FrozenBiLM (0-shot)
Video Question AnsweringMSRVTT-QAAccuracy47FrozenBiLM
Video Question AnsweringMSRVTT-QAAccuracy16.7FrozenBiLM (0-shot)
Video Question AnsweringiVQAAccuracy39.6FrozenBiLM
Video Question AnsweringiVQAAccuracy26.8FrozenBiLM (0-shot)
Video Question AnsweringHow2QAAccuracy86.7FrozenBiLM
Video Question AnsweringHow2QAAccuracy58.4FrozenBiLM (0-shot)
Video Question AnsweringMSVD-QAAccuracy33.8FrozenBiLM
Video Question AnsweringTGIF-QAAccuracy41.9FrozenBiLM
Video Question AnsweringTVQAAccuracy59.7FrozenBiLM (with speech)
Video Question AnsweringTVQAAccuracy29.7FrozenBILM (no speech)
Video Question AnsweringEgoSchema (fullset)Accuracy26.9FrozenBiLM
Video Question AnsweringActivityNet-QAAccuracy24.7FrozenBiLM
Visual Question AnsweringMSVD-QAAccuracy0.548FrozenBiLM
Visual Question AnsweringMSRVTT-QAAccuracy0.47FrozenBiLM

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17