TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MERLOT Reserve: Neural Script Knowledge through Vision and...

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, Yejin Choi

2022-01-07CVPR 2022 1Action ClassificationNavigateVideo UnderstandingVisual Commonsense Reasoning
PaperPDF

Abstract

As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.

Results

TaskDatasetMetricValueModel
VideoKinetics-600Top-1 Accuracy91.1🍷MerlotReserve-Large (+Audio)
VideoKinetics-600Top-5 Accuracy97.1🍷MerlotReserve-Large (+Audio)
VideoKinetics-600Top-1 Accuracy89.7🍷MerlotReserve-Base (+Audio)
VideoKinetics-600Top-5 Accuracy96.6🍷MerlotReserve-Base (+Audio)
VideoKinetics-600Top-1 Accuracy89.4🍷MerlotReserve-Large (no Audio)
VideoKinetics-600Top-5 Accuracy96.3🍷MerlotReserve-Large (no Audio)
VideoKinetics-600Top-1 Accuracy88.1🍷MerlotReserve-Base (no Audio)
VideoKinetics-600Top-5 Accuracy95.8🍷MerlotReserve-Base (no Audio)

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking2025-07-15UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15Privacy-Preserving Multi-Stage Fall Detection Framework with Semi-supervised Federated Learning and Robotic Vision Confirmation2025-07-14EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Automating MD simulations for Proteins using Large language Models: NAMD-Agent2025-07-10