TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ImageBind: One Embedding Space To Bind Them All

ImageBind: One Embedding Space To Bind Them All

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

2023-05-09CVPR 2023 1Cross-Modal RetrievalSound Prompted Semantic SegmentationZero-Shot Video RetrievalMultimodal Deep LearningZero-Shot Environment Sound ClassificationZero-shot Scene Classification (unified classes)AllRetrievalZero-shot Classification (unified classes)Speech Prompted Semantic SegmentationTemporal Relation ExtractionZero-shot Text to Audio RetrievalZero-Shot LearningZero-shot Audio Classification
PaperPDFCodeCodeCode(official)

Abstract

We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.

Results

TaskDatasetMetricValueModel
Relation ExtractionVinogroundGroup Score0.6ImageBind
Relation ExtractionVinogroundText Score9.4ImageBind
Relation ExtractionVinogroundVideo Score3.4ImageBind
Semantic SegmentationADE20KmAP20.2ImageBIND
Semantic SegmentationADE20KmIoU19.7ImageBIND
Semantic SegmentationADE20KmAP19.7ImageBIND
Semantic SegmentationADE20KmIoU20.5ImageBIND
Temporal Relation ExtractionVinogroundGroup Score0.6ImageBind
Temporal Relation ExtractionVinogroundText Score9.4ImageBind
Temporal Relation ExtractionVinogroundVideo Score3.4ImageBind
10-shot image generationADE20KmAP20.2ImageBIND
10-shot image generationADE20KmIoU19.7ImageBIND
10-shot image generationADE20KmAP19.7ImageBIND
10-shot image generationADE20KmIoU20.5ImageBIND
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@136.8ImageBind
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1070ImageBind
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@561.8ImageBind

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16