TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Self-Chained Image-Language Model for Video Localization a...

Self-Chained Image-Language Model for Video Localization and Question Answering

Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal

2023-05-11NeurIPS 2023 11Zero-Shot Video Question AnswerQuestion AnsweringRepresentation LearningVideo Question AnsweringTemporal LocalizationLanguage Modelling
PaperPDFCode(official)

Abstract

Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both temporal keyframe localization and QA on videos. SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2. We propose two ways of chaining these modules for cascaded inference and self-refinement. First, in the forward chain, the Localizer finds multiple language-aware keyframes in a video, which the Answerer uses to predict the answer. Second, in the reverse chain, the Answerer generates keyframe pseudo-labels to refine the Localizer, alleviating the need for expensive video moment localization annotations. Our SeViLA framework outperforms several strong baselines on 5 challenging video QA and event prediction benchmarks, and achieves the state-of-the-art in both fine-tuning (NExT-QA, STAR) and zero-shot (NExT-QA, STAR, How2QA, VLEP) settings. We also analyze the impact of Localizer, comparisons of Localizer with other temporal localization models, pre-training/self-refinement of Localizer, and varying the number of keyframes.

Results

TaskDatasetMetricValueModel
Question AnsweringNExT-QAAccuracy63.6Sevila (4B)
Question AnsweringIntentQAAccuracy60.9SeViLA (4B)
Question AnsweringTVQAAccuracy38.2SEVILA (no speech)
Question AnsweringEgoSchema (fullset)Accuracy22.7SeViLA (4B)
Question AnsweringEgoSchema (subset)Accuracy25.7SeViLA (4B)
Video Question AnsweringSTAR BenchmarkAverage Accuracy64.9SeViLA
Video Question AnsweringSTAR BenchmarkAverage Accuracy44.6SeViLA (0-shot)
Video Question AnsweringNExT-QAAccuracy73.8SeViLA
Video Question AnsweringNExT-QA (Efficient)1:1 Accuracy73.8SeViLA (4 frames)
Video Question AnsweringNExT-QAAccuracy63.6Sevila (4B)
Video Question AnsweringIntentQAAccuracy60.9SeViLA (4B)
Video Question AnsweringTVQAAccuracy38.2SEVILA (no speech)
Video Question AnsweringEgoSchema (fullset)Accuracy22.7SeViLA (4B)
Video Question AnsweringEgoSchema (subset)Accuracy25.7SeViLA (4B)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17