CounTR: Transformer-based Generalised Visual Counting

Chang Liu, Yujie Zhong, Andrew Zisserman, Weidi Xie

2022-08-29Self-Supervised Learning Object Counting Exemplar-Free Counting

Abstract

In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars", i.e. zero-shot or few-shot counting. To this end, we make the following four contributions: (1) We introduce a novel transformer-based architecture for generalised visual object counting, termed as Counting Transformer (CounTR), which explicitly capture the similarity between image patches or with given "exemplars" with the attention mechanism;(2) We adopt a two-stage training regime, that first pre-trains the model with self-supervised learning, and followed by supervised fine-tuning;(3) We propose a simple, scalable pipeline for synthesizing training images with a large number of instances or that from different semantic categories, explicitly forcing the model to make use of the given "exemplars";(4) We conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate state-of-the-art performance on both zero and few-shot settings.

Results

Task	Dataset	Metric	Value	Model
Object Counting	FSC147	MAE(test)	11.95	CounTR
Object Counting	FSC147	MAE(val)	13.13	CounTR
Object Counting	FSC147	RMSE(test)	91.23	CounTR
Object Counting	FSC147	RMSE(val)	49.83	CounTR
Object Counting	CARPK	MAE	5.75	CounTR
Object Counting	CARPK	RMSE	7.45	CounTR
Object Counting	FSC147	MAE(test)	14.71	CounTR
Object Counting	FSC147	MAE(val)	18.07	CounTR
Object Counting	FSC147	RMSE(test)	106.87	CounTR
Object Counting	FSC147	RMSE(val)	71.84	CounTR

Related Papers

A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17 Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14 Car Object Counting and Position Estimation via Extension of the CLIP-EBC Framework2025-07-11 Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08 World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model2025-07-01 ShapeEmbed: a self-supervised learning framework for 2D contour quantification2025-07-01 RetFiner: A Vision-Language Refinement Scheme for Retinal Foundation Models2025-06-27 Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features2025-06-26