Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval

Junyang Chen, Hanjiang Lai

2023-11-13Contrastive Learning Retrieval Zero-Shot Composed Image Retrieval (ZS-CIR)Language Modelling Image Retrieval

Abstract

Zero-shot composed image retrieval (ZS-CIR), which takes a textual modification and a reference image as a query to retrieve a target image without triplet labeling, has gained more and more attention in data mining. Current ZS-CIR research mainly relies on the generalization ability of pre-trained vision-language models, e.g., CLIP. However, the pre-trained vision-language models and CIR tasks have substantial discrepancies, where the vision-language models focus on learning the similarities but CIR aims to learn the modifications of the image guided by text. In this paper, we introduce a novel unlabeled and pre-trained masked tuning approach, which reduces the gap between the pre-trained vision-language model and the downstream CIR task. First, to reduce the gap, we reformulate the contrastive learning of the vision-language model as the CIR task, where we randomly mask input image patches to generate $\langle$masked image, text, image$\rangle$ triplet from an image-text pair. Then, we propose a simple but novel pre-trained masked tuning method, which uses the text and the masked image to learn the modifications of the original image. With such a simple design, the proposed masked tuning can learn to better capture fine-grained text-guided modifications. Extensive experimental results demonstrate the significant superiority of our approach over the baseline models on four ZS-CIR datasets, including FashionIQ, CIRR, CIRCO, and GeneCIS. Our codes are available at https://github.com/Chen-Junyang-cn/PLI

Results

Task	Dataset	Metric	Value	Model
Image Retrieval	Fashion IQ	(Recall@10+Recall@50)/2	46.42	MTCIR (CLIP L/14)
Image Retrieval	CIRCO	mAP@10	11.63	MTCIR (CLIP L/14)
Image Retrieval	CIRCO	mAP@10	8.03	MTCIR (BLIP B/16)
Image Retrieval	CIRR	R@5	58.87	MTCIR (BLIP B/16)
Image Retrieval	CIRR	R@5	54.58	MTCIR (CLIP L/14)
Composed Image Retrieval (CoIR)	Fashion IQ	(Recall@10+Recall@50)/2	46.42	MTCIR (CLIP L/14)
Composed Image Retrieval (CoIR)	CIRCO	mAP@10	11.63	MTCIR (CLIP L/14)
Composed Image Retrieval (CoIR)	CIRCO	mAP@10	8.03	MTCIR (BLIP B/16)
Composed Image Retrieval (CoIR)	CIRR	R@5	58.87	MTCIR (BLIP B/16)
Composed Image Retrieval (CoIR)	CIRR	R@5	54.58	MTCIR (CLIP L/14)

Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval

Abstract

Results

Related Papers

Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval

Abstract

Results

Related Papers