MDETR

Computer VisionIntroduced 200013 papers

Description

MDETR is an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. It utilizes a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. The network is pre-trained on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. The network is then fine-tuned on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation.

Papers Using This Method

Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures2025-05-16 Seeing More with Less: Human-like Representations in Vision Models2025-01-01 A Lightweight Modular Framework for Low-Cost Open-Vocabulary Object Detection Training2024-08-20 ELSA: Evaluating Localization of Social Activities in Urban Streets using Open-Vocabulary Detection2024-06-03 Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models2023-11-05 3D-Aware Visual Question Answering about Parts, Poses and Occlusions2023-10-27 Dynamic Inference With Grounding Based Vision and Language Models2023-01-01 Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding2022-09-28 Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos2022-09-21 Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds2021-12-16 Augmented 2D-TAN: A Two-stage Approach for Human-centric Spatio-Temporal Video Grounding2021-06-20 Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization2021-06-11 MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding2021-04-26