TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Vision Transformer Off-the-Shelf: A Surprising Baseline fo...

Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

Zhicheng Wang, Liwen Xiao, Zhiguo Cao, Hao Lu

2023-05-08Object Counting
PaperPDFCode(official)

Abstract

Class-agnostic counting (CAC) aims to count objects of interest from a query image given few exemplars. This task is typically addressed by extracting the features of query image and exemplars respectively and then matching their feature similarity, leading to an extract-then-match paradigm. In this work, we show that CAC can be simplified in an extract-and-match manner, particularly using a vision transformer (ViT) where feature extraction and similarity matching are executed simultaneously within the self-attention. We reveal the rationale of such simplification from a decoupled view of the self-attention. The resulting model, termed CACViT, simplifies the CAC pipeline into a single pretrained plain ViT. Further, to compensate the loss of the scale and the order-of-magnitude information due to resizing and normalization in plain ViT, we present two effective strategies for scale and magnitude embedding. Extensive experiments on the FSC147 and the CARPK datasets show that CACViT significantly outperforms state-of-the art CAC approaches in both effectiveness (23.60% error reduction) and generalization, which suggests CACViT provides a concise and strong baseline for CAC. Code will be available.

Results

TaskDatasetMetricValueModel
Object CountingFSC147MAE(test)9.13CACViT
Object CountingFSC147MAE(val)10.63CACViT
Object CountingFSC147RMSE(test)48.96CACViT
Object CountingFSC147RMSE(val)37.95CACViT

Related Papers

Car Object Counting and Position Estimation via Extension of the CLIP-EBC Framework2025-07-11OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models2025-06-03Improving Contrastive Learning for Referring Expression Counting2025-05-28InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition2025-05-21Expanding Zero-Shot Object Counting with Rich Prompts2025-05-21VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning2025-05-17Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?2025-05-17Learning What NOT to Count2025-04-16