TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/OmniVL:One Foundation Model for Image-Language and Video-L...

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, Lu Yuan

2022-09-15Cross-Modal RetrievalQuestion AnsweringVideo RetrievalImage ClassificationAction ClassificationVideo-Text RetrievalZero-Shot Video RetrievalText Retrievalcross-modal alignmentVideo Question AnsweringVideo CaptioningImage CaptioningAction RecognitionRetrievalVisual Question Answering (VQA)Temporal Action LocalizationLanguage Modelling
PaperPDF

Abstract

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a decoupled joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.

Results

TaskDatasetMetricValueModel
VideoDiDeMotext-to-video R@152.4OmniVL
VideoDiDeMotext-to-video R@1085.4OmniVL
VideoDiDeMotext-to-video R@579.5OmniVL
VideoMSR-VTTtext-to-video R@147.8OmniVL
VideoMSR-VTTtext-to-video R@1083.8OmniVL
VideoMSR-VTTtext-to-video R@574.2OmniVL
VideoKinetics-400Acc@179.1OmniVL
VideoKinetics-400Acc@594.5OmniVL
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.441OmniVL
Visual Question Answering (VQA)MSVD-QAAccuracy0.51OmniVL
Activity RecognitionSomething-Something V2Top-1 Accuracy62.5OmniVL
Activity RecognitionSomething-Something V2Top-5 Accuracy86.2OmniVL
Image Captioningnocaps-val-out-domainCIDEr106.3OmniVL
Image Captioningnocaps-val-out-domainSPICE14.2OmniVL
Image Captioningnocaps-val-near-domainCIDEr108.3OmniVL
Image Captioningnocaps-val-near-domainSPICE14.9OmniVL
Image Captioningnocaps-val-overallCIDEr107.5OmniVL
Image Captioningnocaps-val-overallSPICE14.7OmniVL
Image Captioningnocaps-val-in-domainCIDEr104.6OmniVL
Image Captioningnocaps-val-in-domainSPICE15OmniVL
Video CaptioningYouCook2BLEU-312.87OmniVL
Video CaptioningYouCook2BLEU-48.72OmniVL
Video CaptioningYouCook2CIDEr1.16OmniVL
Video CaptioningYouCook2METEOR14.83OmniVL
Video CaptioningYouCook2ROUGE-L36.09OmniVL
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@197.3OmniVL (14M)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@10100OmniVL (14M)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@599.9OmniVL (14M)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@187.9OmniVL (14M)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1099.1OmniVL (14M)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@597.8OmniVL (14M)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@182.1OmniVL (14M)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1098.1OmniVL (14M)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@595.9OmniVL (14M)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@164.8OmniVL (14M)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1091.6OmniVL (14M)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@586.1OmniVL (14M)
Action RecognitionSomething-Something V2Top-1 Accuracy62.5OmniVL
Action RecognitionSomething-Something V2Top-5 Accuracy86.2OmniVL
Video RetrievalDiDeMotext-to-video R@152.4OmniVL
Video RetrievalDiDeMotext-to-video R@1085.4OmniVL
Video RetrievalDiDeMotext-to-video R@579.5OmniVL
Video RetrievalMSR-VTTtext-to-video R@147.8OmniVL
Video RetrievalMSR-VTTtext-to-video R@1083.8OmniVL
Video RetrievalMSR-VTTtext-to-video R@574.2OmniVL
Cross-Modal Information RetrievalFlickr30kImage-to-text R@197.3OmniVL (14M)
Cross-Modal Information RetrievalFlickr30kImage-to-text R@10100OmniVL (14M)
Cross-Modal Information RetrievalFlickr30kImage-to-text R@599.9OmniVL (14M)
Cross-Modal Information RetrievalFlickr30kText-to-image R@187.9OmniVL (14M)
Cross-Modal Information RetrievalFlickr30kText-to-image R@1099.1OmniVL (14M)
Cross-Modal Information RetrievalFlickr30kText-to-image R@597.8OmniVL (14M)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@182.1OmniVL (14M)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1098.1OmniVL (14M)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@595.9OmniVL (14M)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@164.8OmniVL (14M)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1091.6OmniVL (14M)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@586.1OmniVL (14M)
Cross-Modal RetrievalFlickr30kImage-to-text R@197.3OmniVL (14M)
Cross-Modal RetrievalFlickr30kImage-to-text R@10100OmniVL (14M)
Cross-Modal RetrievalFlickr30kImage-to-text R@599.9OmniVL (14M)
Cross-Modal RetrievalFlickr30kText-to-image R@187.9OmniVL (14M)
Cross-Modal RetrievalFlickr30kText-to-image R@1099.1OmniVL (14M)
Cross-Modal RetrievalFlickr30kText-to-image R@597.8OmniVL (14M)
Cross-Modal RetrievalCOCO 2014Image-to-text R@182.1OmniVL (14M)
Cross-Modal RetrievalCOCO 2014Image-to-text R@1098.1OmniVL (14M)
Cross-Modal RetrievalCOCO 2014Image-to-text R@595.9OmniVL (14M)
Cross-Modal RetrievalCOCO 2014Text-to-image R@164.8OmniVL (14M)
Cross-Modal RetrievalCOCO 2014Text-to-image R@1091.6OmniVL (14M)
Cross-Modal RetrievalCOCO 2014Text-to-image R@586.1OmniVL (14M)
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@134.6OmniVL
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1066.6OmniVL
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@558.4OmniVL
Zero-Shot Video RetrievalDiDeMotext-to-video R@133.3OmniVL
Zero-Shot Video RetrievalDiDeMotext-to-video R@1068.5OmniVL
Zero-Shot Video RetrievalDiDeMotext-to-video R@558.7OmniVL

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17