TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/GLAMI-1M: A Multilingual Image-Text Fashion Dataset

GLAMI-1M: A Multilingual Image-Text Fashion Dataset

Vaclav Kosar, Antonín Hoskovec, Milan Šulc, Radek Bartyzal

2022-11-17BMVC 2022 11Text ClassificationImage-text Classificationtext-classificationMultilingual Image-Text ClassificationImage GenerationClassification
PaperPDFCode(official)

Abstract

We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text. The dataset, source code and model checkpoints are published at https://github.com/glami/glami-1m

Results

TaskDatasetMetricValueModel
ClassificationGLAMI-1MTop 1 Accuracy %69.7EmbraceNet (image+text)
ClassificationGLAMI-1MTop 5 Accuracy %94EmbraceNet (image+text)
ClassificationGLAMI-1MTop 1 Accuracy %32.3CLIP (zero-shot image+text)
ClassificationGLAMI-1MTop 5 Accuracy %74.5CLIP (zero-shot image+text)
Multi-modal ClassificationGLAMI-1MTop 1 Accuracy %69.7EmbraceNet (image+text)
Multi-modal ClassificationGLAMI-1MTop 5 Accuracy %94EmbraceNet (image+text)
Multi-modal ClassificationGLAMI-1MTop 1 Accuracy %32.3CLIP (zero-shot image+text)
Multi-modal ClassificationGLAMI-1MTop 5 Accuracy %74.5CLIP (zero-shot image+text)
Image-text ClassificationGLAMI-1MTop 1 Accuracy %69.7EmbraceNet (image+text)
Image-text ClassificationGLAMI-1MTop 5 Accuracy %94EmbraceNet (image+text)
Image-text ClassificationGLAMI-1MTop 1 Accuracy %32.3CLIP (zero-shot image+text)
Image-text ClassificationGLAMI-1MTop 5 Accuracy %74.5CLIP (zero-shot image+text)

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17FADE: Adversarial Concept Erasure in Flow Models2025-07-16