TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundatio...

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu

2023-05-29NeurIPS 2023 11Cross-Modal RetrievalZero-Shot Cross-Modal RetrievalText to Audio RetrievalVideo RetrievalZero-Shot Video RetrievalVideo Question AnsweringAudio captioningVideo CaptioningImage CaptioningAudio-visual Question AnsweringLarge Language ModelVisual Question Answering (VQA)TGIF-FrameLanguage Modelling
PaperPDFCodeCode(official)

Abstract

Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.

Results

TaskDatasetMetricValueModel
VideoVATEXtext-to-video R@183VAST
VideoVATEXtext-to-video R@1099.2VAST
VideoVATEXtext-to-video R@598.2VAST
VideoActivityNettext-to-video R@170.5VAST
VideoActivityNettext-to-video R@1095.5VAST
VideoActivityNettext-to-video R@590.9VAST
VideoYouCook2text-to-video R@150.4VAST
VideoYouCook2text-to-video R@1080.8VAST
VideoYouCook2text-to-video R@574.3VAST
VideoDiDeMotext-to-video R@172VAST
VideoDiDeMotext-to-video R@1091.4VAST
VideoDiDeMotext-to-video R@589VAST
VideoMSR-VTTtext-to-video R@163.9VAST
VideoMSR-VTTtext-to-video R@1089.6VAST
VideoMSR-VTTtext-to-video R@584.3VAST
Visual Question Answering (VQA)MSVD-QAAccuracy0.6VAST
Video Question AnsweringActivityNet-QAAccuracy50.4VAST
Video Question AnsweringMSRVTT-QAAccuracy50.1VAST
Image CaptioningCOCO CaptionsCIDER149VAST
Image CaptioningCOCO CaptionsSPICE27VAST
Video CaptioningMSR-VTTBLEU-456.7VAST
Video CaptioningMSR-VTTCIDEr78VAST
Video CaptioningVATEXBLEU-445VAST
Video CaptioningVATEXCIDEr99.5VAST
Video CaptioningTVCBLEU-419.9VAST
Video CaptioningTVCCIDEr74.1VAST
Video CaptioningYouCook2BLEU-418.2VAST
Video CaptioningYouCook2CIDEr1.99VAST
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@191VAST
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1099.5VAST
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@598.5VAST
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@168VAST
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1092.8VAST
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@587.7VAST
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@190.4VAST
Video RetrievalVATEXtext-to-video R@183VAST
Video RetrievalVATEXtext-to-video R@1099.2VAST
Video RetrievalVATEXtext-to-video R@598.2VAST
Video RetrievalActivityNettext-to-video R@170.5VAST
Video RetrievalActivityNettext-to-video R@1095.5VAST
Video RetrievalActivityNettext-to-video R@590.9VAST
Video RetrievalYouCook2text-to-video R@150.4VAST
Video RetrievalYouCook2text-to-video R@1080.8VAST
Video RetrievalYouCook2text-to-video R@574.3VAST
Video RetrievalDiDeMotext-to-video R@172VAST
Video RetrievalDiDeMotext-to-video R@1091.4VAST
Video RetrievalDiDeMotext-to-video R@589VAST
Video RetrievalMSR-VTTtext-to-video R@163.9VAST
Video RetrievalMSR-VTTtext-to-video R@1089.6VAST
Video RetrievalMSR-VTTtext-to-video R@584.3VAST
Audio captioningClothoBLEU-419VAST
Audio captioningClothoCIDEr0.519VAST
Audio captioningClothoMETEOR19.3VAST
Audio captioningClothoROUGE-L40.8VAST
Audio captioningAudioCapsBLEU-40.295VAST
Audio captioningAudioCapsCIDEr0.781VAST
Audio captioningAudioCapsMETEOR0.247VAST
Audio captioningAudioCapsROUGE-L0.509VAST
Cross-Modal Information RetrievalFlickr30kText-to-image R@191VAST
Cross-Modal Information RetrievalFlickr30kText-to-image R@1099.5VAST
Cross-Modal Information RetrievalFlickr30kText-to-image R@598.5VAST
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@168VAST
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1092.8VAST
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@587.7VAST
Cross-Modal RetrievalFlickr30kText-to-image R@191VAST
Cross-Modal RetrievalFlickr30kText-to-image R@1099.5VAST
Cross-Modal RetrievalFlickr30kText-to-image R@598.5VAST
Cross-Modal RetrievalCOCO 2014Text-to-image R@168VAST
Cross-Modal RetrievalCOCO 2014Text-to-image R@1092.8VAST
Cross-Modal RetrievalCOCO 2014Text-to-image R@587.7VAST
Text to Audio RetrievalAudioCapsR@152VAST
Text to Audio RetrievalAudioCapsR@1082.9VAST
Text to Audio RetrievalAudioCapsR@576.8VAST
Text to Audio RetrievalClothoR@126.9VAST
Text to Audio RetrievalClothoR@1066.1VAST
Text to Audio RetrievalClothoR@553.2VAST
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@149.3VAST
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1073.9VAST
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@568.3VAST
Zero-Shot Video RetrievalDiDeMotext-to-video R@155.5VAST
Zero-Shot Video RetrievalDiDeMotext-to-video R@1079.6VAST
Zero-Shot Video RetrievalDiDeMotext-to-video R@574.3VAST
Audio-visual Question AnsweringMUSIC-AVQAAcc80.7VAST

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17