TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/UniVL: A Unified Video and Language Pre-Training Model for...

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, Ming Zhou

2020-02-15Action SegmentationVideo RetrievalVideo CaptioningLanguage Modelling
PaperPDFCodeCode(official)

Abstract

With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks. This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. Five objectives, including video-text joint, conditioned masked language model (CMLM), conditioned masked frame model (CMFM), video-text alignment, and language reconstruction, are designed to train each of the components. We further develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV), to make the training process of the UniVL more effective. The pre-train is carried out on a sizeable instructional video dataset HowTo100M. Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.

Results

TaskDatasetMetricValueModel
VideoYouCook2text-to-video Median Rank4UniVL
VideoYouCook2text-to-video R@128.9UniVL
VideoYouCook2text-to-video R@1070UniVL
VideoYouCook2text-to-video R@557.6UniVL
VideoMSR-VTTtext-to-video Median Rank6UniVL
VideoMSR-VTTtext-to-video R@121.2UniVL
VideoMSR-VTTtext-to-video R@1063.1UniVL
VideoMSR-VTTtext-to-video R@549.6UniVL
Action LocalizationCOINFrame accuracy70Univl
Video CaptioningYouCook2BLEU-323.87UniVL
Video CaptioningYouCook2BLEU-417.35UniVL
Video CaptioningYouCook2CIDEr1.81UniVL
Video CaptioningYouCook2METEOR22.35UniVL
Video CaptioningYouCook2ROUGE-L46.52UniVL
Video RetrievalYouCook2text-to-video Median Rank4UniVL
Video RetrievalYouCook2text-to-video R@128.9UniVL
Video RetrievalYouCook2text-to-video R@1070UniVL
Video RetrievalYouCook2text-to-video R@557.6UniVL
Video RetrievalMSR-VTTtext-to-video Median Rank6UniVL
Video RetrievalMSR-VTTtext-to-video R@121.2UniVL
Video RetrievalMSR-VTTtext-to-video R@1063.1UniVL
Video RetrievalMSR-VTTtext-to-video R@549.6UniVL
Action SegmentationCOINFrame accuracy70Univl

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing2025-07-16