TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Revealing Single Frame Bias for Video-and-Language Learning

Revealing Single Frame Bias for Video-and-Language Learning

Jie Lei, Tamara L. Berg, Mohit Bansal

2022-06-07Question AnsweringVideo RetrievalZero-Shot Video RetrievalFine-grained Action RecognitionText to Video RetrievalVideo Question AnsweringAction RecognitionRetrievalLanguage Modelling
PaperPDFCodeCode(official)

Abstract

Training an effective video-and-language model intuitively requires multiple frames as model inputs. However, it is unclear whether using multiple frames is beneficial to downstream tasks, and if yes, whether the performance gain is worth the drastically-increased computation and memory costs resulting from using more frames. In this work, we explore single-frame models for video-and-language learning. On a diverse set of video-and-language tasks (including text-to-video retrieval and video question answering), we show the surprising result that, with large-scale pre-training and a proper frame ensemble strategy at inference time, a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training. This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets. Therefore, to allow for a more comprehensive evaluation of video-and-language models, we propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling. Our code is available at https://github.com/jayleicn/singularity

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video R@141.5Singularity
VideoMSR-VTT-1kAtext-to-video R@1077Singularity
VideoMSR-VTT-1kAtext-to-video R@568.7Singularity
VideoSSv2-template retrievaltext-to-video R@177.6Singularity-temporal
VideoSSv2-template retrievaltext-to-video R@1098.9Singularity-temporal
VideoSSv2-template retrievaltext-to-video R@596Singularity-temporal
VideoActivityNettext-to-video R@147.1Singularity
VideoActivityNettext-to-video R@1085.5Singularity
VideoActivityNettext-to-video R@575.5Singularity
VideoSSv2-label retrievaltext-to-video R@147.4Singularity-temporal
VideoSSv2-label retrievaltext-to-video R@1084Singularity-temporal
VideoSSv2-label retrievaltext-to-video R@575.9Singularity-temporal
VideoDiDeMotext-to-video R@153.9Singularity
VideoDiDeMotext-to-video R@1086.9Singularity
VideoDiDeMotext-to-video R@579.4Singularity
Video Question AnsweringActivityNet-QAAccuracy44.1Singularity-temporal
Video Question AnsweringActivityNet-QAAccuracy43.1Singularity
Video Question AnsweringMSRVTT-QAAccuracy43.9Singularity-temporal
Video Question AnsweringMSRVTT-QAAccuracy43.5Singularity
Video Question AnsweringMSRVTT-MCAccuracy93.7Singularity-temporal
Video Question AnsweringMSRVTT-MCAccuracy92.1Singularity
Video RetrievalMSR-VTT-1kAtext-to-video R@141.5Singularity
Video RetrievalMSR-VTT-1kAtext-to-video R@1077Singularity
Video RetrievalMSR-VTT-1kAtext-to-video R@568.7Singularity
Video RetrievalSSv2-template retrievaltext-to-video R@177.6Singularity-temporal
Video RetrievalSSv2-template retrievaltext-to-video R@1098.9Singularity-temporal
Video RetrievalSSv2-template retrievaltext-to-video R@596Singularity-temporal
Video RetrievalActivityNettext-to-video R@147.1Singularity
Video RetrievalActivityNettext-to-video R@1085.5Singularity
Video RetrievalActivityNettext-to-video R@575.5Singularity
Video RetrievalSSv2-label retrievaltext-to-video R@147.4Singularity-temporal
Video RetrievalSSv2-label retrievaltext-to-video R@1084Singularity-temporal
Video RetrievalSSv2-label retrievaltext-to-video R@575.9Singularity-temporal
Video RetrievalDiDeMotext-to-video R@153.9Singularity
Video RetrievalDiDeMotext-to-video R@1086.9Singularity
Video RetrievalDiDeMotext-to-video R@579.4Singularity
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@134Singularity-17M
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1066.7Singularity-17M
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@556.7Singularity-17M
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@128.4Singularity-5M
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1059.5Singularity-5M
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@550.2Singularity-5M
Zero-Shot Video RetrievalDiDeMotext-to-video R@137.1Singularity-17M
Zero-Shot Video RetrievalDiDeMotext-to-video R@1069.9Singularity-17M
Zero-Shot Video RetrievalDiDeMotext-to-video R@561.7Singularity-17M
Zero-Shot Video RetrievalDiDeMotext-to-video R@136.9Singularity-5M
Zero-Shot Video RetrievalDiDeMotext-to-video R@1069.3Singularity-5M
Zero-Shot Video RetrievalDiDeMotext-to-video R@561.1Singularity-5M
Zero-Shot Video RetrievalActivityNettext-to-video R@130.8Singularity-temporal-5M
Zero-Shot Video RetrievalActivityNettext-to-video R@1066.3Singularity-temporal-5M
Zero-Shot Video RetrievalActivityNettext-to-video R@555.9Singularity-temporal-5M
Zero-Shot Video RetrievalActivityNettext-to-video R@130.6Singularity-temporal-17M
Zero-Shot Video RetrievalActivityNettext-to-video R@1066.9Singularity-temporal-17M
Zero-Shot Video RetrievalActivityNettext-to-video R@555.6Singularity-temporal-17M

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17