TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Explicit Box Detection Unifies End-to-End Multi-Person Pos...

Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation

Jie Yang, Ailing Zeng, Shilong Liu, Feng Li, Ruimao Zhang, Lei Zhang

2023-02-03Human Detectionregression2D Human Pose EstimationPose EstimationMulti-Person Pose EstimationKeypoint Detection
PaperPDFCodeCode(official)Code

Abstract

This paper presents a novel end-to-end framework with Explicit box Detection for multi-person Pose estimation, called ED-Pose, where it unifies the contextual learning between human-level (global) and keypoint-level (local) information. Different from previous one-stage methods, ED-Pose re-considers this task as two explicit box detection processes with a unified representation and regression supervision. First, we introduce a human detection decoder from encoded tokens to extract global features. It can provide a good initialization for the latter keypoint detection, making the training process converge fast. Second, to bring in contextual information near keypoints, we regard pose estimation as a keypoint box detection problem to learn both box positions and contents for each keypoint. A human-to-keypoint detection decoder adopts an interactive learning strategy between human and keypoint features to further enhance global and local feature aggregation. In general, ED-Pose is conceptually simple without post-processing and dense heatmap supervision. It demonstrates its effectiveness and efficiency compared with both two-stage and one-stage methods. Notably, explicit box detection boosts the pose estimation performance by 4.5 AP on COCO and 9.9 AP on CrowdPose. For the first time, as a fully end-to-end framework with a L1 regression loss, ED-Pose surpasses heatmap-based Top-down methods under the same backbone by 1.2 AP on COCO and achieves the state-of-the-art with 76.6 AP on CrowdPose without bells and whistles. Code is available at https://github.com/IDEA-Research/ED-Pose.

Results

TaskDatasetMetricValueModel
Pose EstimationCrowdPoseAP Easy83ED-Pose (Swin-L)
Pose EstimationCrowdPoseAP Hard68.3ED-Pose (Swin-L)
Pose EstimationCrowdPoseAP Medium77.3ED-Pose (Swin-L)
Pose EstimationCrowdPosemAP @0.5:0.9576.6ED-Pose (Swin-L)
3DCrowdPoseAP Easy83ED-Pose (Swin-L)
3DCrowdPoseAP Hard68.3ED-Pose (Swin-L)
3DCrowdPoseAP Medium77.3ED-Pose (Swin-L)
3DCrowdPosemAP @0.5:0.9576.6ED-Pose (Swin-L)
2D Human Pose EstimationHuman-ArtAP0.723ED-Pose (R50)
Multi-Person Pose EstimationCrowdPoseAP Easy83ED-Pose (Swin-L)
Multi-Person Pose EstimationCrowdPoseAP Hard68.3ED-Pose (Swin-L)
Multi-Person Pose EstimationCrowdPoseAP Medium77.3ED-Pose (Swin-L)
Multi-Person Pose EstimationCrowdPosemAP @0.5:0.9576.6ED-Pose (Swin-L)
1 Image, 2*2 StitchiCrowdPoseAP Easy83ED-Pose (Swin-L)
1 Image, 2*2 StitchiCrowdPoseAP Hard68.3ED-Pose (Swin-L)
1 Image, 2*2 StitchiCrowdPoseAP Medium77.3ED-Pose (Swin-L)
1 Image, 2*2 StitchiCrowdPosemAP @0.5:0.9576.6ED-Pose (Swin-L)

Related Papers

Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression2025-07-20$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model2025-07-17From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation2025-07-17AthleticsPose: Authentic Sports Motion Dataset on Athletic Field and Evaluation of Monocular 3D Pose Estimation Ability2025-07-17Neural Network-Guided Symbolic Regression for Interpretable Descriptor Discovery in Perovskite Catalysts2025-07-16Imbalanced Regression Pipeline Recommendation2025-07-16