TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MDETR -- Modulated Detection for End-to-End Multi-Modal Un...

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

Aishwarya Kamath, Mannat Singh, Yann Lecun, Gabriel Synnaeve, Ishan Misra, Nicolas Carion

2021-04-26Referring Image Matting (RefMatte-RW100)Question AnsweringVisual GroundingReferring ExpressionGeneralized Referring Expression ComprehensionReferring Expression ComprehensionReferring Image Matting (Expression-based)Referring Expression SegmentationReferring Image Matting (Keyword-based)Phrase GroundingVisual Question Answering (VQA)Visual Question Answering
PaperPDFCodeCode(official)CodeCodeCode

Abstract

Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes. This makes it challenging for such systems to capture the long tail of visual concepts expressed in free form text. In this paper we propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. We pre-train the network on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. We then fine-tune on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation, achieving state-of-the-art results on popular benchmarks. We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting. We show that our pre-training approach provides a way to handle the long tail of object categories which have very few labelled instances. Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR. The code and models are available at https://github.com/ashkamath/mdetr.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)GQA test-stdAccuracy62.45MDETR-ENB5
Visual Question Answering (VQA)CLEVRAccuracy99.7MDETR
Visual Question Answering (VQA)CLEVR-HumansAccuracy81.7MDETR
Phrase GroundingFlickr30k Entities TestR@184.3MDETR-ENB5
Phrase GroundingFlickr30k Entities TestR@1095.8MDETR-ENB5
Phrase GroundingFlickr30k Entities TestR@593.9MDETR-ENB5
Instance SegmentationPhraseCutMean IoU53.7MDETR ENB3
Instance SegmentationPhraseCutPr@0.557.5MDETR ENB3
Instance SegmentationPhraseCutPr@0.739.9MDETR ENB3
Instance SegmentationPhraseCutPr@0.911.9MDETR ENB3
Referring Expression SegmentationPhraseCutMean IoU53.7MDETR ENB3
Referring Expression SegmentationPhraseCutPr@0.557.5MDETR ENB3
Referring Expression SegmentationPhraseCutPr@0.739.9MDETR ENB3
Referring Expression SegmentationPhraseCutPr@0.911.9MDETR ENB3
Referring Image MattingRefMatteMAD0.0482MDETR (ResNet-101)
Referring Image MattingRefMatteMAD(E)0.0515MDETR (ResNet-101)
Referring Image MattingRefMatteMSE0.0434MDETR (ResNet-101)
Referring Image MattingRefMatteMSE(E)0.0463MDETR (ResNet-101)
Referring Image MattingRefMatteSAD84.7MDETR (ResNet-101)
Referring Image MattingRefMatteSAD(E)90.45MDETR (ResNet-101)
Referring Image MattingRefMatteMAD0.0183MDETR (ResNet-101)
Referring Image MattingRefMatteMAD(E)0.019MDETR (ResNet-101)
Referring Image MattingRefMatteMSE0.0137MDETR (ResNet-101)
Referring Image MattingRefMatteMSE(E)0.0141MDETR (ResNet-101)
Referring Image MattingRefMatteSAD32.27MDETR (ResNet-101)
Referring Image MattingRefMatteSAD(E)33.52MDETR (ResNet-101)
Referring Image MattingRefMatteMAD0.0751MDETR (ResNet-101)
Referring Image MattingRefMatteMAD(E)0.0779MDETR (ResNet-101)
Referring Image MattingRefMatteMSE0.0675MDETR (ResNet-101)
Referring Image MattingRefMatteMSE(E)0.07MDETR (ResNet-101)
Referring Image MattingRefMatteSAD131.58MDETR (ResNet-101)
Referring Image MattingRefMatteSAD(E)136.59MDETR (ResNet-101)
Generalized Referring Expression ComprehensiongRefCOCON-acc.36.1MDETR
Generalized Referring Expression ComprehensiongRefCOCOPrecision@(F1=1, IoU≥0.5)41.5MDETR

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16