X-Pose: Detecting Any Keypoints

Jie Yang, Ailing Zeng, Ruimao Zhang, Lei Zhang

2023-10-122D Human Pose Estimation Multi-Person Pose Estimation Keypoint Detection Contrastive Learning Animal Pose Estimation 2D Pose Estimation

Paper PDF Code(official)Code(official)

Abstract

This work aims to address an advanced keypoint detection problem: how to accurately detect any keypoints in complex real-world scenarios, which involves massive, messy, and open-ended objects as well as their associated keypoints definitions. Current high-performance keypoint detectors often fail to tackle this problem due to their two-stage schemes, under-explored prompt designs, and limited training data. To bridge the gap, we propose X-Pose, a novel end-to-end framework with multi-modal (i.e., visual, textual, or their combinations) prompts to detect multi-object keypoints for any articulated (e.g., human and animal), rigid, and soft objects within a given image. Moreover, we introduce a large-scale dataset called UniKPT, which unifies 13 keypoint detection datasets with 338 keypoints across 1,237 categories over 400K instances. Training with UniKPT, X-Pose effectively aligns text-to-keypoint and image-to-keypoint due to the mutual enhancement of multi-modal prompts based on cross-modality contrastive learning. Our experimental results demonstrate that X-Pose achieves notable improvements of 27.7 AP, 6.44 PCK, and 7.0 AP compared to state-of-the-art non-promptable, visual prompt-based, and textual prompt-based methods in each respective fair setting. More importantly, the in-the-wild test demonstrates X-Pose's strong fine-grained keypoint localization and generalization abilities across image styles, object categories, and poses, paving a new path to multi-object keypoint detection in real applications. Our code and dataset are available at https://github.com/IDEA-Research/X-Pose.

Results

Task	Dataset	Metric	Value	Model
Pose Estimation	COCO (Common Objects in Context)	AP	0.768	UniPose
Pose Estimation	AP-10K	AP	79.2	UniPose
2D Pose Estimation	Vinegar Fly	Mean PCK@0.2	99.9	UniPose
2D Pose Estimation	300W	Mean PCK@0.2	99.4	UniPose
2D Pose Estimation	MacaquePose	AP	79.4	UniPose
2D Pose Estimation	Desert Locust	Mean PCK@0.2	99.9	UniPose
2D Pose Estimation	Animal Kingdom	Mean PCK@0.2	96.1	UniPose
2D Pose Estimation	Animal Kingdom	PCK@0.05	71.5	UniPose
3D	COCO (Common Objects in Context)	AP	0.768	UniPose
3D	AP-10K	AP	79.2	UniPose
Animal Pose Estimation	AP-10K	AP	79.2	UniPose
2D Human Pose Estimation	Human-Art	AP	0.759	UniPose
2D Classification	Vinegar Fly	Mean PCK@0.2	99.9	UniPose
2D Classification	300W	Mean PCK@0.2	99.4	UniPose
2D Classification	MacaquePose	AP	79.4	UniPose
2D Classification	Desert Locust	Mean PCK@0.2	99.9	UniPose
2D Classification	Animal Kingdom	Mean PCK@0.2	96.1	UniPose
2D Classification	Animal Kingdom	PCK@0.05	71.5	UniPose
Multi-Person Pose Estimation	COCO (Common Objects in Context)	AP	0.768	UniPose
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	AP	0.768	UniPose
1 Image, 2*2 Stitchi	AP-10K	AP	79.2	UniPose

X-Pose: Detecting Any Keypoints

Abstract

Results

Related Papers

X-Pose: Detecting Any Keypoints

Abstract

Results

Related Papers