TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/All in One: Exploring Unified Video-Language Pre-training

All in One: Exploring Unified Video-Language Pre-training

Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, XiaoHu Qie, Mike Zheng Shou

2022-03-14CVPR 2023 1Question AnsweringVideo RetrievalRepresentation LearningVideo Question AnsweringTGIF-TransitionAllRetrievalVisual Question Answering (VQA)TGIF-ActionVisual Commonsense ReasoningTGIF-FrameLanguage ModellingMultiple-choice
PaperPDFCode(official)

Abstract

Mainstream Video-Language Pre-training models \cite{actbert,clipbert,violet} consist of three parts, a video encoder, a text encoder, and a video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal fusion Transformers, resulting in increased parameters with lower efficiency in downstream tasks. In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture. We argue that the unique temporal information of video data turns out to be a key barrier hindering the design of a modality-agnostic Transformer. To overcome the challenge, we introduce a novel and effective token rolling operation to encode temporal representations from video clips in a non-parametric manner. The careful design enables the representation learning of both video-text multimodal inputs and unimodal inputs using a unified backbone model. Our pre-trained all-in-one Transformer is transferred to various downstream video-text tasks after fine-tuning, including text-video retrieval, video-question answering, multiple choice and visual commonsense reasoning. State-of-the-art performances with the minimal model FLOPs on nine datasets demonstrate the superiority of our method compared to the competitive counterparts. The code and pretrained model have been released in https://github.com/showlab/all-in-one.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video R@137.9All-in-one-B
VideoMSR-VTT-1kAtext-to-video R@1077.1All-in-one-B
VideoMSR-VTT-1kAtext-to-video R@568.1All-in-one-B
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.443All-in-one-B
Visual Question Answering (VQA)MSVD-QAAccuracy0.483All-in-one-B
Video Question AnsweringSTAR BenchmarkAverage Accuracy47.5All-in-one
Video RetrievalMSR-VTT-1kAtext-to-video R@137.9All-in-one-B
Video RetrievalMSR-VTT-1kAtext-to-video R@1077.1All-in-one-B
Video RetrievalMSR-VTT-1kAtext-to-video R@568.1All-in-one-B

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17