Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Wang Zeng, Sheng Jin, Wentao Liu, Chen Qian, Ping Luo, Wanli Ouyang, Xiaogang Wang

2022-04-19CVPR 2022 13D Human Pose Estimation 2D Human Pose Estimation Pose Estimation Clustering All

Abstract

Vision transformers have achieved great successes in many computer vision tasks. Most methods generate vision tokens by splitting an image into a regular and fixed grid and treating each cell as a token. However, not all regions are equally important in human-centric vision tasks, e.g., the human body needs a fine representation with many tokens, while the image background can be modeled by a few tokens. To address this problem, we propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer), which merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes. The tokens in TCFormer can not only focus on important areas but also adjust the token shapes to fit the semantic concept and adopt a fine resolution for regions containing critical details, which is beneficial to capturing detailed information. Extensive experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets, including whole-body pose estimation on COCO-WholeBody and 3D human mesh reconstruction on 3DPW. Code is available at https://github.com/zengwang430521/TCFormer.git

Results

Task	Dataset	Metric	Value	Model
3D Human Pose Estimation	3DPW	MPJPE	80.6	TCFormer
3D Human Pose Estimation	3DPW	PA-MPJPE	49.3	TCFormer
Pose Estimation	3DPW	MPJPE	80.6	TCFormer
Pose Estimation	3DPW	PA-MPJPE	49.3	TCFormer
3D	3DPW	MPJPE	80.6	TCFormer
3D	3DPW	PA-MPJPE	49.3	TCFormer
2D Human Pose Estimation	COCO-WholeBody	WB	64.2	TCFormer
2D Human Pose Estimation	COCO-WholeBody	body	71.8	TCFormer
2D Human Pose Estimation	COCO-WholeBody	face	79	TCFormer
2D Human Pose Estimation	COCO-WholeBody	foot	74.4	TCFormer
2D Human Pose Estimation	COCO-WholeBody	hand	61.4	TCFormer
1 Image, 2*2 Stitchi	3DPW	MPJPE	80.6	TCFormer
1 Image, 2*2 Stitchi	3DPW	PA-MPJPE	49.3	TCFormer

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Abstract

Results

Related Papers

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Abstract

Results

Related Papers