TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MoVie: Revisiting Modulated Convolutions for Visual Counti...

MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond

Duy-Kien Nguyen, Vedanuj Goswami, Xinlei Chen

2020-04-24ICLR 2021 1Question AnsweringObject CountingVisual Question Answering (VQA)
PaperPDFCode(official)

Abstract

This paper focuses on visual counting, which aims to predict the number of occurrences given a natural image and a query (e.g. a question or a category). Unlike most prior works that use explicit, symbolic models which can be computationally expensive and limited in generalization, we propose a simple and effective alternative by revisiting modulated convolutions that fuse the query and the image locally. Following the design of residual bottleneck, we call our method MoVie, short for Modulated conVolutional bottlenecks. Notably, MoVie reasons implicitly and holistically and only needs a single forward-pass during inference. Nevertheless, MoVie showcases strong performance for counting: 1) advancing the state-of-the-art on counting-specific VQA tasks while being more efficient; 2) outperforming prior-art on difficult benchmarks like COCO for common object counting; 3) helped us secure the first place of 2020 VQA challenge when integrated as a module for 'number' related questions in generic VQA models. Finally, we show evidence that modulated convolutions such as MoVie can serve as a general mechanism for reasoning tasks beyond counting.

Results

TaskDatasetMetricValueModel
Object CountingHowMany-QAAccuracy64MoVie-ResNeXt
Object CountingHowMany-QARMSE2.3MoVie-ResNeXt
Object CountingHowMany-QAAccuracy61.2MoVie
Object CountingHowMany-QARMSE2.36MoVie
Object CountingTallyQA-ComplexAccuracy56.8MoVie-ResNeXt
Object CountingTallyQA-ComplexRMSE1.43MoVie-ResNeXt
Object CountingTallyQA-ComplexAccuracy54.1MoVie
Object CountingTallyQA-ComplexRMSE1.52MoVie
Object CountingTallyQA-SimpleAccuracy74.9MoVie-ResNeXt
Object CountingTallyQA-SimpleRMSE1MoVie-ResNeXt
Object CountingTallyQA-SimpleAccuracy70.8MoVie
Object CountingTallyQA-SimpleRMSE1.09MoVie

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16