Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images

Javier Marin, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, Yusuf Aytar, Ingmar Weber, Antonio Torralba

2018-10-14Cross-Modal Retrieval General Classification Retrieval

Abstract

In this paper, we introduce Recipe1M+, a new large-scale, structured corpus of over one million cooking recipes and 13 million food images. As the largest publicly available collection of recipe data, Recipe1M+ affords the ability to train high-capacity modelson aligned, multimodal data. Using these data, we train a neural network to learn a joint embedding of recipes and images that yields impressive results on an image-recipe retrieval task. Moreover, we demonstrate that regularization via the addition of a high-level classification objective both improves retrieval performance to rival that of humans and enables semantic vector arithmetic. We postulate that these embeddings will provide a basis for further exploration of the Recipe1M+ dataset and food and cooking in general. Code, data and models are publicly available.

Results

Task	Dataset	Metric	Value	Model
Image Retrieval with Multi-Modal Query	Recipe1M+	Image-to-text R@1	17	Marin et al.
Image Retrieval with Multi-Modal Query	Recipe1M+	Text-to-image R@1	21	Marin et al.
Cross-Modal Information Retrieval	Recipe1M+	Image-to-text R@1	17	Marin et al.
Cross-Modal Information Retrieval	Recipe1M+	Text-to-image R@1	21	Marin et al.
Cross-Modal Retrieval	Recipe1M+	Image-to-text R@1	17	Marin et al.
Cross-Modal Retrieval	Recipe1M+	Text-to-image R@1	21	Marin et al.

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17 A Survey of Context Engineering for Large Language Models2025-07-17 MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17 Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 Context-Aware Search and Retrieval Over Erasure Channels2025-07-16 Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15