TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Self-Adaptive Sampling for Efficient Video Question-Answer...

Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models

Wei Han, Hui Chen, Min-Yen Kan, Soujanya Poria

2023-07-09Question AnsweringVideo Question AnsweringVideo UnderstandingVisual Question Answering (VQA)TGIF-FrameVisual Question Answering
PaperPDFCode(official)Code(official)

Abstract

Video question-answering is a fundamental task in the field of video understanding. Although current vision--language models (VLMs) equipped with Video Transformers have enabled temporal modeling and yielded superior results, they are at the cost of huge computational power and thus too expensive to deploy in real-time application scenarios. An economical workaround only samples a small portion of frames to represent the main content of that video and tune an image--text model on these sampled frames. Recent video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem. We argue that such kinds of aimless sampling may omit the key frames from which the correct answer can be deduced, and the situation gets worse when the sampling sparsity increases, which always happens as the video lengths increase. To mitigate this issue, we propose two frame sampling strategies, namely the most domain frames (MDF) and most implied frames (MIF), to maximally preserve those frames that are most likely vital to the given questions. MDF passively minimizes the risk of key frame omission in a bootstrap manner, while MIS actively searches key frames customized for each video--question pair with the assistance of auxiliary models. The experimental results on three public datasets from three advanced VLMs (CLIP, GIT and All-in-one) demonstrate that our proposed strategies can boost the performance for image-text pretrained models. The source codes pertaining to the method proposed in this paper are publicly available at https://github.com/declare-lab/sas-vqa.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.44AIO+MIF
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.438AIO+MDF
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.423GIT+MDF
Visual Question Answering (VQA)MSVD-QAAccuracy0.469GIT+MDF
Visual Question Answering (VQA)MSVD-QAAccuracy0.467AIO+MIF

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16