Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu
Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | VATEX | text-to-video R@1 | 83 | VAST |
| Video | VATEX | text-to-video R@10 | 99.2 | VAST |
| Video | VATEX | text-to-video R@5 | 98.2 | VAST |
| Video | ActivityNet | text-to-video R@1 | 70.5 | VAST |
| Video | ActivityNet | text-to-video R@10 | 95.5 | VAST |
| Video | ActivityNet | text-to-video R@5 | 90.9 | VAST |
| Video | YouCook2 | text-to-video R@1 | 50.4 | VAST |
| Video | YouCook2 | text-to-video R@10 | 80.8 | VAST |
| Video | YouCook2 | text-to-video R@5 | 74.3 | VAST |
| Video | DiDeMo | text-to-video R@1 | 72 | VAST |
| Video | DiDeMo | text-to-video R@10 | 91.4 | VAST |
| Video | DiDeMo | text-to-video R@5 | 89 | VAST |
| Video | MSR-VTT | text-to-video R@1 | 63.9 | VAST |
| Video | MSR-VTT | text-to-video R@10 | 89.6 | VAST |
| Video | MSR-VTT | text-to-video R@5 | 84.3 | VAST |
| Visual Question Answering (VQA) | MSVD-QA | Accuracy | 0.6 | VAST |
| Video Question Answering | ActivityNet-QA | Accuracy | 50.4 | VAST |
| Video Question Answering | MSRVTT-QA | Accuracy | 50.1 | VAST |
| Image Captioning | COCO Captions | CIDER | 149 | VAST |
| Image Captioning | COCO Captions | SPICE | 27 | VAST |
| Video Captioning | MSR-VTT | BLEU-4 | 56.7 | VAST |
| Video Captioning | MSR-VTT | CIDEr | 78 | VAST |
| Video Captioning | VATEX | BLEU-4 | 45 | VAST |
| Video Captioning | VATEX | CIDEr | 99.5 | VAST |
| Video Captioning | TVC | BLEU-4 | 19.9 | VAST |
| Video Captioning | TVC | CIDEr | 74.1 | VAST |
| Video Captioning | YouCook2 | BLEU-4 | 18.2 | VAST |
| Video Captioning | YouCook2 | CIDEr | 1.99 | VAST |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@1 | 91 | VAST |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@10 | 99.5 | VAST |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@5 | 98.5 | VAST |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@1 | 68 | VAST |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@10 | 92.8 | VAST |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@5 | 87.7 | VAST |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@1 | 90.4 | VAST |
| Video Retrieval | VATEX | text-to-video R@1 | 83 | VAST |
| Video Retrieval | VATEX | text-to-video R@10 | 99.2 | VAST |
| Video Retrieval | VATEX | text-to-video R@5 | 98.2 | VAST |
| Video Retrieval | ActivityNet | text-to-video R@1 | 70.5 | VAST |
| Video Retrieval | ActivityNet | text-to-video R@10 | 95.5 | VAST |
| Video Retrieval | ActivityNet | text-to-video R@5 | 90.9 | VAST |
| Video Retrieval | YouCook2 | text-to-video R@1 | 50.4 | VAST |
| Video Retrieval | YouCook2 | text-to-video R@10 | 80.8 | VAST |
| Video Retrieval | YouCook2 | text-to-video R@5 | 74.3 | VAST |
| Video Retrieval | DiDeMo | text-to-video R@1 | 72 | VAST |
| Video Retrieval | DiDeMo | text-to-video R@10 | 91.4 | VAST |
| Video Retrieval | DiDeMo | text-to-video R@5 | 89 | VAST |
| Video Retrieval | MSR-VTT | text-to-video R@1 | 63.9 | VAST |
| Video Retrieval | MSR-VTT | text-to-video R@10 | 89.6 | VAST |
| Video Retrieval | MSR-VTT | text-to-video R@5 | 84.3 | VAST |
| Audio captioning | Clotho | BLEU-4 | 19 | VAST |
| Audio captioning | Clotho | CIDEr | 0.519 | VAST |
| Audio captioning | Clotho | METEOR | 19.3 | VAST |
| Audio captioning | Clotho | ROUGE-L | 40.8 | VAST |
| Audio captioning | AudioCaps | BLEU-4 | 0.295 | VAST |
| Audio captioning | AudioCaps | CIDEr | 0.781 | VAST |
| Audio captioning | AudioCaps | METEOR | 0.247 | VAST |
| Audio captioning | AudioCaps | ROUGE-L | 0.509 | VAST |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@1 | 91 | VAST |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@10 | 99.5 | VAST |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@5 | 98.5 | VAST |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@1 | 68 | VAST |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@10 | 92.8 | VAST |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@5 | 87.7 | VAST |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@1 | 91 | VAST |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@10 | 99.5 | VAST |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@5 | 98.5 | VAST |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@1 | 68 | VAST |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@10 | 92.8 | VAST |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@5 | 87.7 | VAST |
| Text to Audio Retrieval | AudioCaps | R@1 | 52 | VAST |
| Text to Audio Retrieval | AudioCaps | R@10 | 82.9 | VAST |
| Text to Audio Retrieval | AudioCaps | R@5 | 76.8 | VAST |
| Text to Audio Retrieval | Clotho | R@1 | 26.9 | VAST |
| Text to Audio Retrieval | Clotho | R@10 | 66.1 | VAST |
| Text to Audio Retrieval | Clotho | R@5 | 53.2 | VAST |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@1 | 49.3 | VAST |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@10 | 73.9 | VAST |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@5 | 68.3 | VAST |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@1 | 55.5 | VAST |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@10 | 79.6 | VAST |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@5 | 74.3 | VAST |
| Audio-visual Question Answering | MUSIC-AVQA | Acc | 80.7 | VAST |