GLAMI-1M: A Multilingual Image-Text Fashion Dataset

Vaclav Kosar, Antonín Hoskovec, Milan Šulc, Radek Bartyzal

2022-11-17BMVC 2022 11Text Classification Image-text Classification text-classification Multilingual Image-Text Classification Image Generation Classification

Paper PDF Code(official)

Abstract

We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text. The dataset, source code and model checkpoints are published at https://github.com/glami/glami-1m

Results

Task	Dataset	Metric	Value	Model
Classification	GLAMI-1M	Top 1 Accuracy %	69.7	EmbraceNet (image+text)
Classification	GLAMI-1M	Top 5 Accuracy %	94	EmbraceNet (image+text)
Classification	GLAMI-1M	Top 1 Accuracy %	32.3	CLIP (zero-shot image+text)
Classification	GLAMI-1M	Top 5 Accuracy %	74.5	CLIP (zero-shot image+text)
Multi-modal Classification	GLAMI-1M	Top 1 Accuracy %	69.7	EmbraceNet (image+text)
Multi-modal Classification	GLAMI-1M	Top 5 Accuracy %	94	EmbraceNet (image+text)
Multi-modal Classification	GLAMI-1M	Top 1 Accuracy %	32.3	CLIP (zero-shot image+text)
Multi-modal Classification	GLAMI-1M	Top 5 Accuracy %	74.5	CLIP (zero-shot image+text)
Image-text Classification	GLAMI-1M	Top 1 Accuracy %	69.7	EmbraceNet (image+text)
Image-text Classification	GLAMI-1M	Top 5 Accuracy %	94	EmbraceNet (image+text)
Image-text Classification	GLAMI-1M	Top 1 Accuracy %	32.3	CLIP (zero-shot image+text)
Image-text Classification	GLAMI-1M	Top 5 Accuracy %	74.5	CLIP (zero-shot image+text)

GLAMI-1M: A Multilingual Image-Text Fashion Dataset

Abstract

Results

Related Papers

GLAMI-1M: A Multilingual Image-Text Fashion Dataset

Abstract

Results

Related Papers