TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VALOR: Vision-Audio-Language Omni-Perception Pretraining M...

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Jing Liu, Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang

2023-04-17Cross-Modal RetrievalQuestion AnsweringText GenerationText to Audio RetrievalVideo RetrievalVideo Question AnsweringAudio captioningVideo CaptioningImage CaptioningAudio-visual Question AnsweringAudio-Video Question Answering (AVQA)RetrievalVisual Question Answering (VQA)Zero-shot Text to Audio RetrievalConditional Text GenerationTGIF-FrameAudio-Visual Question Answering (AVQA)
PaperPDFCode(official)

Abstract

In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation. Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner. It contains three separate encoders for single modality representations, and a decoder for multimodal conditional text generation. We design two pretext tasks to pretrain VALOR model, including Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC). MGA projects vision, language and audio to the same common space, building vision-language, audio-language and audiovisual-language alignment simultaneously. MGC learns how to generate text tokens in conditions of vision, audio or their both. To promote vision-audio-language pretraining research, we construct a large-scale high-quality tri-modality dataset named VALOR-1M, which contains 1M audiable videos with human annotated audiovisual captions. Extensive experiments show that VALOR can learn strong multimodal correlations and be generalized to various downstream tasks (e.g., retrieval, captioning and question answering), with different input modalities (e.g., vision-language, audio-language and audiovisual-language). VALOR achieves new state-of-the-art performances on series of public cross-modality benchmarks. Code and data are available at project page https://casia-iva-group.github.io/projects/VALOR.

Results

TaskDatasetMetricValueModel
VideoVATEXtext-to-video R@178.5VALOR
VideoVATEXtext-to-video R@1098.7VALOR
VideoVATEXtext-to-video R@597.1VALOR
VideoActivityNettext-to-video R@170.1VALOR
VideoActivityNettext-to-video R@1095.3VALOR
VideoActivityNettext-to-video R@590.8VALOR
VideoDiDeMotext-to-video R@161.5VALOR
VideoDiDeMotext-to-video R@1090.4VALOR
VideoDiDeMotext-to-video R@585.3VALOR
VideoMSR-VTTtext-to-video R@159.9VALOR
VideoMSR-VTTtext-to-video R@1089.6VALOR
VideoMSR-VTTtext-to-video R@583.5VALOR
VideoLSMDCtext-to-video R@134.2VALOR
VideoLSMDCtext-to-video R@1064.1VALOR
VideoLSMDCtext-to-video R@556VALOR
Visual Question Answering (VQA)MSVD-QAAccuracy0.6VALOR
Visual Question Answering (VQA)VQA v2 test-devAccuracy78.46VALOR
Visual Question Answering (VQA)VQA v2 test-stdoverall78.62VALOR
Video Question AnsweringActivityNet-QAAccuracy48.6VALOR
Video Question AnsweringMSRVTT-QAAccuracy49.2VALOR
Image CaptioningCOCO CaptionsCIDER152.5VALOR
Image CaptioningCOCO CaptionsSPICE25.7VALOR
Video CaptioningMSR-VTTBLEU-454.4VALOR
Video CaptioningMSR-VTTCIDEr74VALOR
Video CaptioningMSR-VTTMETEOR32.9VALOR
Video CaptioningMSR-VTTROUGE-L68VALOR
Video CaptioningVATEXBLEU-445.6VALOR
Video CaptioningVATEXCIDEr95.8VALOR
Video CaptioningVATEXMETEOR29.4VALOR
Video CaptioningVATEXROUGE-L57.4VALOR
Video CaptioningMSVDBLEU-480.7VALOR
Video CaptioningMSVDCIDEr178.5VALOR
Video CaptioningMSVDMETEOR51VALOR
Video CaptioningMSVDROUGE-L87.9VALOR
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@161.4VALOR
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1090.9VALOR
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@584.4VALOR
Video RetrievalVATEXtext-to-video R@178.5VALOR
Video RetrievalVATEXtext-to-video R@1098.7VALOR
Video RetrievalVATEXtext-to-video R@597.1VALOR
Video RetrievalActivityNettext-to-video R@170.1VALOR
Video RetrievalActivityNettext-to-video R@1095.3VALOR
Video RetrievalActivityNettext-to-video R@590.8VALOR
Video RetrievalDiDeMotext-to-video R@161.5VALOR
Video RetrievalDiDeMotext-to-video R@1090.4VALOR
Video RetrievalDiDeMotext-to-video R@585.3VALOR
Video RetrievalMSR-VTTtext-to-video R@159.9VALOR
Video RetrievalMSR-VTTtext-to-video R@1089.6VALOR
Video RetrievalMSR-VTTtext-to-video R@583.5VALOR
Video RetrievalLSMDCtext-to-video R@134.2VALOR
Video RetrievalLSMDCtext-to-video R@1064.1VALOR
Video RetrievalLSMDCtext-to-video R@556VALOR
Audio captioningClothoBLEU-416.2VALOR
Audio captioningClothoCIDEr0.423VALOR
Audio captioningClothoMETEOR17.4VALOR
Audio captioningClothoROUGE-L38.2VALOR
Audio captioningAudioCapsBLEU-40.27VALOR
Audio captioningAudioCapsCIDEr0.741VALOR
Audio captioningAudioCapsMETEOR0.231VALOR
Audio captioningAudioCapsROUGE-L0.494VALOR
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@161.4VALOR
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1090.9VALOR
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@584.4VALOR
Cross-Modal RetrievalCOCO 2014Text-to-image R@161.4VALOR
Cross-Modal RetrievalCOCO 2014Text-to-image R@1090.9VALOR
Cross-Modal RetrievalCOCO 2014Text-to-image R@584.4VALOR
Text to Audio RetrievalAudioCapsR@140.1VALOR
Text to Audio RetrievalAudioCapsR@1083.1VALOR
Text to Audio RetrievalAudioCapsR@573.9VALOR
Text to Audio RetrievalClothoR@117.5VALOR
Text to Audio RetrievalClothoR@1055.3VALOR
Text to Audio RetrievalClothoR@542.7VALOR
Audio-visual Question AnsweringMUSIC-AVQAAcc78.9VALOR

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17