ImageBind: One Embedding Space To Bind Them All

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

2023-05-09CVPR 2023 1Cross-Modal Retrieval Sound Prompted Semantic Segmentation Zero-Shot Video Retrieval Multimodal Deep Learning Zero-Shot Environment Sound Classification Zero-shot Scene Classification (unified classes)All Retrieval Zero-shot Classification (unified classes)Speech Prompted Semantic Segmentation Temporal Relation Extraction Zero-shot Text to Audio Retrieval Zero-Shot Learning Zero-shot Audio Classification

Paper PDF Code Code Code(official)

Abstract

We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.

Results

Task	Dataset	Metric	Value	Model
Relation Extraction	Vinoground	Group Score	0.6	ImageBind
Relation Extraction	Vinoground	Text Score	9.4	ImageBind
Relation Extraction	Vinoground	Video Score	3.4	ImageBind
Semantic Segmentation	ADE20K	mAP	20.2	ImageBIND
Semantic Segmentation	ADE20K	mIoU	19.7	ImageBIND
Semantic Segmentation	ADE20K	mAP	19.7	ImageBIND
Semantic Segmentation	ADE20K	mIoU	20.5	ImageBIND
Temporal Relation Extraction	Vinoground	Group Score	0.6	ImageBind
Temporal Relation Extraction	Vinoground	Text Score	9.4	ImageBind
Temporal Relation Extraction	Vinoground	Video Score	3.4	ImageBind
10-shot image generation	ADE20K	mAP	20.2	ImageBIND
10-shot image generation	ADE20K	mIoU	19.7	ImageBIND
10-shot image generation	ADE20K	mAP	19.7	ImageBIND
10-shot image generation	ADE20K	mIoU	20.5	ImageBIND
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	36.8	ImageBind
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	70	ImageBind
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	61.8	ImageBind

ImageBind: One Embedding Space To Bind Them All

Abstract

Results

Related Papers

ImageBind: One Embedding Space To Bind Them All

Abstract

Results

Related Papers