Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, Lu Yuan
This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a decoupled joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | DiDeMo | text-to-video R@1 | 52.4 | OmniVL |
| Video | DiDeMo | text-to-video R@10 | 85.4 | OmniVL |
| Video | DiDeMo | text-to-video R@5 | 79.5 | OmniVL |
| Video | MSR-VTT | text-to-video R@1 | 47.8 | OmniVL |
| Video | MSR-VTT | text-to-video R@10 | 83.8 | OmniVL |
| Video | MSR-VTT | text-to-video R@5 | 74.2 | OmniVL |
| Video | Kinetics-400 | Acc@1 | 79.1 | OmniVL |
| Video | Kinetics-400 | Acc@5 | 94.5 | OmniVL |
| Visual Question Answering (VQA) | MSRVTT-QA | Accuracy | 0.441 | OmniVL |
| Visual Question Answering (VQA) | MSVD-QA | Accuracy | 0.51 | OmniVL |
| Activity Recognition | Something-Something V2 | Top-1 Accuracy | 62.5 | OmniVL |
| Activity Recognition | Something-Something V2 | Top-5 Accuracy | 86.2 | OmniVL |
| Image Captioning | nocaps-val-out-domain | CIDEr | 106.3 | OmniVL |
| Image Captioning | nocaps-val-out-domain | SPICE | 14.2 | OmniVL |
| Image Captioning | nocaps-val-near-domain | CIDEr | 108.3 | OmniVL |
| Image Captioning | nocaps-val-near-domain | SPICE | 14.9 | OmniVL |
| Image Captioning | nocaps-val-overall | CIDEr | 107.5 | OmniVL |
| Image Captioning | nocaps-val-overall | SPICE | 14.7 | OmniVL |
| Image Captioning | nocaps-val-in-domain | CIDEr | 104.6 | OmniVL |
| Image Captioning | nocaps-val-in-domain | SPICE | 15 | OmniVL |
| Video Captioning | YouCook2 | BLEU-3 | 12.87 | OmniVL |
| Video Captioning | YouCook2 | BLEU-4 | 8.72 | OmniVL |
| Video Captioning | YouCook2 | CIDEr | 1.16 | OmniVL |
| Video Captioning | YouCook2 | METEOR | 14.83 | OmniVL |
| Video Captioning | YouCook2 | ROUGE-L | 36.09 | OmniVL |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@1 | 97.3 | OmniVL (14M) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@10 | 100 | OmniVL (14M) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@5 | 99.9 | OmniVL (14M) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@1 | 87.9 | OmniVL (14M) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@10 | 99.1 | OmniVL (14M) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@5 | 97.8 | OmniVL (14M) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@1 | 82.1 | OmniVL (14M) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@10 | 98.1 | OmniVL (14M) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@5 | 95.9 | OmniVL (14M) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@1 | 64.8 | OmniVL (14M) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@10 | 91.6 | OmniVL (14M) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@5 | 86.1 | OmniVL (14M) |
| Action Recognition | Something-Something V2 | Top-1 Accuracy | 62.5 | OmniVL |
| Action Recognition | Something-Something V2 | Top-5 Accuracy | 86.2 | OmniVL |
| Video Retrieval | DiDeMo | text-to-video R@1 | 52.4 | OmniVL |
| Video Retrieval | DiDeMo | text-to-video R@10 | 85.4 | OmniVL |
| Video Retrieval | DiDeMo | text-to-video R@5 | 79.5 | OmniVL |
| Video Retrieval | MSR-VTT | text-to-video R@1 | 47.8 | OmniVL |
| Video Retrieval | MSR-VTT | text-to-video R@10 | 83.8 | OmniVL |
| Video Retrieval | MSR-VTT | text-to-video R@5 | 74.2 | OmniVL |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@1 | 97.3 | OmniVL (14M) |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@10 | 100 | OmniVL (14M) |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@5 | 99.9 | OmniVL (14M) |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@1 | 87.9 | OmniVL (14M) |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@10 | 99.1 | OmniVL (14M) |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@5 | 97.8 | OmniVL (14M) |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@1 | 82.1 | OmniVL (14M) |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@10 | 98.1 | OmniVL (14M) |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@5 | 95.9 | OmniVL (14M) |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@1 | 64.8 | OmniVL (14M) |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@10 | 91.6 | OmniVL (14M) |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@5 | 86.1 | OmniVL (14M) |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@1 | 97.3 | OmniVL (14M) |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@10 | 100 | OmniVL (14M) |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@5 | 99.9 | OmniVL (14M) |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@1 | 87.9 | OmniVL (14M) |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@10 | 99.1 | OmniVL (14M) |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@5 | 97.8 | OmniVL (14M) |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@1 | 82.1 | OmniVL (14M) |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@10 | 98.1 | OmniVL (14M) |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@5 | 95.9 | OmniVL (14M) |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@1 | 64.8 | OmniVL (14M) |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@10 | 91.6 | OmniVL (14M) |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@5 | 86.1 | OmniVL (14M) |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@1 | 34.6 | OmniVL |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@10 | 66.6 | OmniVL |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@5 | 58.4 | OmniVL |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@1 | 33.3 | OmniVL |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@10 | 68.5 | OmniVL |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@5 | 58.7 | OmniVL |