TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VLM: Task-agnostic Video-Language Model Pre-training for V...

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer

2021-05-20Findings (ACL) 2021 8Action SegmentationVideo RetrievalVideo CaptioningVideo UnderstandingRetrievalTemporal Action LocalizationLanguage Modelling
PaperPDFCode(official)

Abstract

We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

Results

TaskDatasetMetricValueModel
VideoCrossTaskRecall46.5VLM
VideoMSR-VTT-1kAtext-to-video Median Rank4VLM
VideoMSR-VTT-1kAtext-to-video R@128.1VLM
VideoMSR-VTT-1kAtext-to-video R@1067.4VLM
VideoMSR-VTT-1kAtext-to-video R@555.5VLM
VideoYouCook2text-to-video Median Rank4VLM
VideoYouCook2text-to-video R@127.05VLM
VideoYouCook2text-to-video R@1069.38VLM
VideoYouCook2text-to-video R@556.88VLM
Temporal Action LocalizationCrossTaskRecall46.5VLM
Zero-Shot LearningCrossTaskRecall46.5VLM
Action LocalizationCrossTaskRecall46.5VLM
Action LocalizationCOINFrame accuracy68.4VLM
Video CaptioningYouCook2BLEU-317.78VLM
Video CaptioningYouCook2BLEU-412.27VLM
Video CaptioningYouCook2CIDEr1.3869VLM
Video CaptioningYouCook2METEOR18.22VLM
Video CaptioningYouCook2ROUGE-L41.51VLM
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank4VLM
Video RetrievalMSR-VTT-1kAtext-to-video R@128.1VLM
Video RetrievalMSR-VTT-1kAtext-to-video R@1067.4VLM
Video RetrievalMSR-VTT-1kAtext-to-video R@555.5VLM
Video RetrievalYouCook2text-to-video Median Rank4VLM
Video RetrievalYouCook2text-to-video R@127.05VLM
Video RetrievalYouCook2text-to-video R@1069.38VLM
Video RetrievalYouCook2text-to-video R@556.88VLM
Action SegmentationCOINFrame accuracy68.4VLM

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17