Description
Multiscale Attention ViT with Late fusion (MAVL) is a multi-modal network, trained with aligned image-text pairs, capable of performing targeted detection using human understandable natural language text queries. It utilizes multi-scale image features and uses deformable convolutions with late multi-modal fusion. The authors demonstrate excellent ability of MAVL as class-agnostic object detector when queried using general human understandable natural language command, such as "all objects", "all entities", etc.
Papers Using This Method
MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation2025-05-24Benchmarking Chest X-ray Diagnosis Models Across Multinational Datasets2025-05-21Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection2022-07-07Class-agnostic Object Detection with Multi-modal Transformer2021-11-22