TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Transformer Decoders with MultiModal Regularization for Cr...

Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval

Mustafa Shukor, Guillaume Couairon, Asya Grechka, Matthieu Cord

2022-04-20Cross-Modal RetrievalRetrieval
PaperPDFCode(official)

Abstract

Cross-modal image-recipe retrieval has gained significant attention in recent years. Most work focuses on improving cross-modal embeddings using unimodal encoders, that allow for efficient retrieval in large-scale databases, leaving aside cross-attention between modalities which is more computationally expensive. We propose a new retrieval framework, T-Food (Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval) that exploits the interaction between modalities in a novel regularization scheme, while using only unimodal encoders at test time for efficient retrieval. We also capture the intra-dependencies between recipe entities with a dedicated recipe encoder, and propose new variants of triplet losses with dynamic margins that adapt to the difficulty of the task. Finally, we leverage the power of the recent Vision and Language Pretraining (VLP) models such as CLIP for the image encoder. Our approach outperforms existing approaches by a large margin on the Recipe1M dataset. Specifically, we achieve absolute improvements of 8.1 % (72.6 R@1) and +10.9 % (44.6 R@1) on the 1k and 10k test sets respectively. The code is available here:https://github.com/mshukor/TFood

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryRecipe1MImage-to-text R@172.3T-Food (CLIP)
Image Retrieval with Multi-Modal QueryRecipe1MText-to-image R@172.6T-Food (CLIP)
Image Retrieval with Multi-Modal QueryRecipe1MImage-to-text R@168.2T-Food
Image Retrieval with Multi-Modal QueryRecipe1MText-to-image R@168.3T-Food
Cross-Modal Information RetrievalRecipe1MImage-to-text R@172.3T-Food (CLIP)
Cross-Modal Information RetrievalRecipe1MText-to-image R@172.6T-Food (CLIP)
Cross-Modal Information RetrievalRecipe1MImage-to-text R@168.2T-Food
Cross-Modal Information RetrievalRecipe1MText-to-image R@168.3T-Food
Cross-Modal RetrievalRecipe1MImage-to-text R@172.3T-Food (CLIP)
Cross-Modal RetrievalRecipe1MText-to-image R@172.6T-Food (CLIP)
Cross-Modal RetrievalRecipe1MImage-to-text R@168.2T-Food
Cross-Modal RetrievalRecipe1MText-to-image R@168.3T-Food

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15