TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VLAB: Enhancing Video Language Pre-training by Feature Ada...

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Xingjian He, Sihan Chen, Fan Ma, Zhicheng Huang, Xiaojie Jin, Zikang Liu, Dongmei Fu, Yi Yang, Jing Liu, Jiashi Feng

2023-05-22Question AnsweringVideo RetrievalVideo-Text RetrievalText RetrievalVideo Question AnsweringVideo CaptioningRetrievalVisual Question Answering (VQA)TGIF-Frame
PaperPDF

Abstract

Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations. However, there is limited research on learning video-text representations for general video multimodal tasks based on these powerful features. Towards this goal, we propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature Adapting and Blending, which transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks. Specifically, VLAB is founded on two key strategies: feature adapting and feature blending. In the former, we introduce a new video adapter module to address CLIP's deficiency in modeling temporal information and extend the model's capability to encompass both contrastive and generative tasks. In the latter, we propose an end-to-end training method that further enhances the model's performance by exploiting the complementarity of image and video features. We validate the effectiveness and versatility of VLAB through extensive experiments on highly competitive video multimodal tasks, including video text retrieval, video captioning, and video question answering. Remarkably, VLAB outperforms competing methods significantly and sets new records in video question answering on MSRVTT, MSVD, and TGIF datasets. It achieves an accuracy of 49.6, 61.0, and 79.0, respectively. Codes and models will be released.

Results

TaskDatasetMetricValueModel
VideoDiDeMotext-to-video R@156.8VLAB
VideoDiDeMotext-to-video R@1088.7VLAB
VideoDiDeMotext-to-video R@581.6VLAB
VideoMSR-VTTtext-to-video R@155.1VLAB
VideoMSR-VTTtext-to-video R@1087.6VLAB
VideoMSR-VTTtext-to-video R@578.8VLAB
VideoMSVDtext-to-video R@157.5VLAB
VideoMSVDtext-to-video R@1089.9VLAB
VideoMSVDtext-to-video R@583.6VLAB
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.496VLAB
Visual Question Answering (VQA)MSVD-QAAccuracy0.61VLAB
Video CaptioningMSR-VTTBLEU-454.6VLAB
Video CaptioningMSR-VTTCIDEr74.9VLAB
Video CaptioningMSR-VTTMETEOR33.4VLAB
Video CaptioningMSR-VTTROUGE-L68.3VLAB
Video CaptioningMSVDBLEU-479.3VLAB
Video CaptioningMSVDCIDEr179.8VLAB
Video CaptioningMSVDMETEOR51.2VLAB
Video CaptioningMSVDROUGE-L87.9VLAB
Video RetrievalDiDeMotext-to-video R@156.8VLAB
Video RetrievalDiDeMotext-to-video R@1088.7VLAB
Video RetrievalDiDeMotext-to-video R@581.6VLAB
Video RetrievalMSR-VTTtext-to-video R@155.1VLAB
Video RetrievalMSR-VTTtext-to-video R@1087.6VLAB
Video RetrievalMSR-VTTtext-to-video R@578.8VLAB
Video RetrievalMSVDtext-to-video R@157.5VLAB
Video RetrievalMSVDtext-to-video R@1089.9VLAB
Video RetrievalMSVDtext-to-video R@583.6VLAB

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17