TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LanguageBind: Extending Video-Language Pretraining to N-mo...

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, Hongfa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Li Yuan

2023-10-03Audio ClassificationVideo-Text RetrievalZero-Shot Video RetrievalText RetrievalMultimodal Deep LearningZero-Shot Environment Sound ClassificationScene Classification (unified classes)Zero-Shot Action RecognitionContrastive LearningZero-shot Scene Classification (unified classes)Zero-shot Text RetrievalZero-shot Classification (unified classes)Temporal Relation ExtractionZero-shot Text to Audio RetrievalZero-shot Audio Classification
PaperPDFCodeCodeCodeCode(official)CodeCode

Abstract

The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. LanguageBind has achieved superior performance on a wide range of 15 benchmarks covering video, audio, depth, and infrared. Moreover, multiple experiments have provided evidence for the effectiveness of LanguageBind in achieving indirect alignment and complementarity among diverse modalities. Code address: https://github.com/PKU-YuanGroup/LanguageBind

Results

TaskDatasetMetricValueModel
Relation ExtractionVinogroundGroup Score1.2LanguageBind
Relation ExtractionVinogroundText Score10.6LanguageBind
Relation ExtractionVinogroundVideo Score5LanguageBind
Zero-Shot Action RecognitionKineticsTop-1 Accuracy64.1LanguageBind
Zero-Shot Action RecognitionKineticsTop-5 Accuracy85.7LanguageBind
Temporal Relation ExtractionVinogroundGroup Score1.2LanguageBind
Temporal Relation ExtractionVinogroundText Score10.6LanguageBind
Temporal Relation ExtractionVinogroundVideo Score5LanguageBind
Zero-Shot Video RetrievalMSR-VTTtext-to-video Median Rank2LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@144.8LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1078.7LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@570LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalMSR-VTTvideo-to-text Median Rank2LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@140.9LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@1075.7LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@566.4LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalMSR-VTTtext-to-video Median Rank2LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@142.8LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1076LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@567.5LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalMSR-VTTvideo-to-text Median Rank3LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@138.3LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@1077.8LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@565.8LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalMSVDtext-to-video Median Rank1LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalMSVDtext-to-video R@154.1LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalMSVDtext-to-video R@1088.1LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalMSVDtext-to-video R@581.1LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalMSVDvideo-to-text Median Rank1LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalMSVDvideo-to-text R@169.7LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalMSVDvideo-to-text R@1097.9LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalMSVDvideo-to-text R@591.8LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalMSVDtext-to-video Median Rank1LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalMSVDtext-to-video R@153.9LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalMSVDtext-to-video R@1087.8LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalMSVDtext-to-video R@580.4LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalMSVDvideo-to-text Median Rank1LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalMSVDvideo-to-text R@172LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalMSVDvideo-to-text R@1096.3LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalMSVDvideo-to-text R@591.4LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalDiDeMotext-to-video Median Rank2LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalDiDeMotext-to-video R@139.9LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalDiDeMotext-to-video R@1074.6LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalDiDeMotext-to-video R@566.1LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalDiDeMovideo-to-text R@139.8LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalDiDeMovideo-to-text R@1076.2LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalDiDeMovideo-to-text R@567.8LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalDiDeMotext-to-video Median Rank2LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalDiDeMotext-to-video R@139.7LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalDiDeMotext-to-video R@1073.8LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalDiDeMotext-to-video R@565.5LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalDiDeMovideo-to-text R@138.4LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalDiDeMovideo-to-text R@1077.9LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalDiDeMovideo-to-text R@566.6LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalActivityNettext-to-video R@141LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalActivityNettext-to-video R@1080LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalActivityNettext-to-video R@568.4LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalActivityNetvideo-to-text R@139.1LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalActivityNetvideo-to-text R@1081.1LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalActivityNetvideo-to-text R@569.8LanguageBind(ViT-H/14)
Zero-Shot Video RetrievalActivityNettext-to-video R@138.4LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalActivityNettext-to-video R@1077.9LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalActivityNettext-to-video R@566.6LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalActivityNetvideo-to-text R@135.7LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalActivityNetvideo-to-text R@1077.8LanguageBind(ViT-L/14)
Zero-Shot Video RetrievalActivityNetvideo-to-text R@565.8LanguageBind(ViT-L/14)

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16LLM-Driven Dual-Level Multi-Interest Modeling for Recommendation2025-07-15