TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Not All Tokens Are Equal: Human-centric Visual Analysis vi...

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Wang Zeng, Sheng Jin, Wentao Liu, Chen Qian, Ping Luo, Wanli Ouyang, Xiaogang Wang

2022-04-19CVPR 2022 13D Human Pose Estimation2D Human Pose EstimationPose EstimationClusteringAll
PaperPDFCode(official)

Abstract

Vision transformers have achieved great successes in many computer vision tasks. Most methods generate vision tokens by splitting an image into a regular and fixed grid and treating each cell as a token. However, not all regions are equally important in human-centric vision tasks, e.g., the human body needs a fine representation with many tokens, while the image background can be modeled by a few tokens. To address this problem, we propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer), which merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes. The tokens in TCFormer can not only focus on important areas but also adjust the token shapes to fit the semantic concept and adopt a fine resolution for regions containing critical details, which is beneficial to capturing detailed information. Extensive experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets, including whole-body pose estimation on COCO-WholeBody and 3D human mesh reconstruction on 3DPW. Code is available at https://github.com/zengwang430521/TCFormer.git

Results

TaskDatasetMetricValueModel
3D Human Pose Estimation3DPWMPJPE80.6TCFormer
3D Human Pose Estimation3DPWPA-MPJPE49.3TCFormer
Pose Estimation3DPWMPJPE80.6TCFormer
Pose Estimation3DPWPA-MPJPE49.3TCFormer
3D3DPWMPJPE80.6TCFormer
3D3DPWPA-MPJPE49.3TCFormer
2D Human Pose EstimationCOCO-WholeBodyWB64.2TCFormer
2D Human Pose EstimationCOCO-WholeBodybody71.8TCFormer
2D Human Pose EstimationCOCO-WholeBodyface79TCFormer
2D Human Pose EstimationCOCO-WholeBodyfoot74.4TCFormer
2D Human Pose EstimationCOCO-WholeBodyhand61.4TCFormer
1 Image, 2*2 Stitchi3DPWMPJPE80.6TCFormer
1 Image, 2*2 Stitchi3DPWPA-MPJPE49.3TCFormer

Related Papers

Tri-Learn Graph Fusion Network for Attributed Graph Clustering2025-07-18$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model2025-07-17From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation2025-07-17AthleticsPose: Authentic Sports Motion Dataset on Athletic Field and Evaluation of Monocular 3D Pose Estimation Ability2025-07-17SpatialTrackerV2: 3D Point Tracking Made Easy2025-07-16SGLoc: Semantic Localization System for Camera Pose Estimation from 3D Gaussian Splatting Representation2025-07-16