Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra
We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Relation Extraction | Vinoground | Group Score | 0.6 | ImageBind |
| Relation Extraction | Vinoground | Text Score | 9.4 | ImageBind |
| Relation Extraction | Vinoground | Video Score | 3.4 | ImageBind |
| Semantic Segmentation | ADE20K | mAP | 20.2 | ImageBIND |
| Semantic Segmentation | ADE20K | mIoU | 19.7 | ImageBIND |
| Semantic Segmentation | ADE20K | mAP | 19.7 | ImageBIND |
| Semantic Segmentation | ADE20K | mIoU | 20.5 | ImageBIND |
| Temporal Relation Extraction | Vinoground | Group Score | 0.6 | ImageBind |
| Temporal Relation Extraction | Vinoground | Text Score | 9.4 | ImageBind |
| Temporal Relation Extraction | Vinoground | Video Score | 3.4 | ImageBind |
| 10-shot image generation | ADE20K | mAP | 20.2 | ImageBIND |
| 10-shot image generation | ADE20K | mIoU | 19.7 | ImageBIND |
| 10-shot image generation | ADE20K | mAP | 19.7 | ImageBIND |
| 10-shot image generation | ADE20K | mIoU | 20.5 | ImageBIND |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@1 | 36.8 | ImageBind |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@10 | 70 | ImageBind |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@5 | 61.8 | ImageBind |