TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VLCounter: Text-aware Visual Representation for Zero-Shot ...

VLCounter: Text-aware Visual Representation for Zero-Shot Object Counting

Seunggu Kang, WonJun Moon, Euiyeon Kim, Jae-Pil Heo

2023-12-27Zero-Shot CountingObject Counting
PaperPDFCode(official)

Abstract

Zero-Shot Object Counting (ZSOC) aims to count referred instances of arbitrary classes in a query image without human-annotated exemplars. To deal with ZSOC, preceding studies proposed a two-stage pipeline: discovering exemplars and counting. However, there remains a challenge of vulnerability to error propagation of the sequentially designed two-stage process. In this work, an one-stage baseline, Visual-Language Baseline (VLBase), exploring the implicit association of the semantic-patch embeddings of CLIP is proposed. Subsequently, the extension of VLBase to Visual-language Counter (VLCounter) is achieved by incorporating three modules devised to tailor VLBase for object counting. First, Semantic-conditioned Prompt Tuning (SPT) is introduced within the image encoder to acquire target-highlighted representations. Second, Learnable Affine Transformation (LAT) is employed to translate the semantic-patch similarity map to be appropriate for the counting task. Lastly, the layer-wisely encoded features are transferred to the decoder through Segment-aware Skip Connection (SaSC) to keep the generalization capability for unseen classes. Through extensive experiments on FSC147, CARPK, and PUCPR+, the benefits of the end-to-end framework, VLCounter, are demonstrated.

Results

TaskDatasetMetricValueModel
Object CountingCARPKMAE6.46VLCounter
Object CountingCARPKRMSE8.68VLCounter

Related Papers

Car Object Counting and Position Estimation via Extension of the CLIP-EBC Framework2025-07-11OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models2025-06-03Improving Contrastive Learning for Referring Expression Counting2025-05-28Expanding Zero-Shot Object Counting with Rich Prompts2025-05-21InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition2025-05-21VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning2025-05-17Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?2025-05-17Learning What NOT to Count2025-04-16