Description
MDETR is an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. It utilizes a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. The network is pre-trained on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. The network is then fine-tuned on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation.
Papers Using This Method
Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures2025-05-16Seeing More with Less: Human-like Representations in Vision Models2025-01-01A Lightweight Modular Framework for Low-Cost Open-Vocabulary Object Detection Training2024-08-20ELSA: Evaluating Localization of Social Activities in Urban Streets using Open-Vocabulary Detection2024-06-03Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models2023-11-053D-Aware Visual Question Answering about Parts, Poses and Occlusions2023-10-27Dynamic Inference With Grounding Based Vision and Language Models2023-01-01Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding2022-09-28Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos2022-09-21Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds2021-12-16Augmented 2D-TAN: A Two-stage Approach for Human-centric Spatio-Temporal Video Grounding2021-06-20Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization2021-06-11MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding2021-04-26