Improving Image Recognition by Retrieving from Web-Scale Image-Text Data

Ahmet Iscen, Alireza Fathi, Cordelia Schmid

2023-04-11CVPR 2023 1Image Classification Long-tail Learning Learning with noisy labels

Abstract

Retrieval augmented models are becoming increasingly popular for computer vision tasks after their recent success in NLP problems. The goal is to enhance the recognition capabilities of the model by retrieving similar examples for the visual input from an external memory set. In this work, we introduce an attention-based memory module, which learns the importance of each retrieved example from the memory. Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query. We also thoroughly study various ways of constructing the memory dataset. Our experiments show the benefit of using a massive-scale memory dataset of 1B image-text pairs, and demonstrate the performance of different memory representations. We evaluate our method in three different classification tasks, namely long-tailed recognition, learning with noisy labels, and fine-grained classification, and show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.

Results

Task	Dataset	Metric	Value	Model
Image Classification	WebVision-1000	Top-1 Accuracy	83.6	MAM (ViT-B/16)
Image Classification	Places-LT	Top-1 Accuracy	51.4	MAM (ViT-B/16)
Image Classification	ImageNet-LT	Top-1 Accuracy	82.3	MAM (ViT-B/16)
Few-Shot Image Classification	Places-LT	Top-1 Accuracy	51.4	MAM (ViT-B/16)
Few-Shot Image Classification	ImageNet-LT	Top-1 Accuracy	82.3	MAM (ViT-B/16)
Generalized Few-Shot Classification	Places-LT	Top-1 Accuracy	51.4	MAM (ViT-B/16)
Generalized Few-Shot Classification	ImageNet-LT	Top-1 Accuracy	82.3	MAM (ViT-B/16)
Long-tail Learning	Places-LT	Top-1 Accuracy	51.4	MAM (ViT-B/16)
Long-tail Learning	ImageNet-LT	Top-1 Accuracy	82.3	MAM (ViT-B/16)
Generalized Few-Shot Learning	Places-LT	Top-1 Accuracy	51.4	MAM (ViT-B/16)
Generalized Few-Shot Learning	ImageNet-LT	Top-1 Accuracy	82.3	MAM (ViT-B/16)

Improving Image Recognition by Retrieving from Web-Scale Image-Text Data

Abstract

Results

Related Papers

Improving Image Recognition by Retrieving from Web-Scale Image-Text Data

Abstract

Results

Related Papers