TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/In My Perspective, In My Hands: Accurate Egocentric 2D Han...

In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

Wiktor Mucha, Martin Kampel

2024-04-14Skeleton Based Action RecognitionVideo UnderstandingAction RecognitionHand Pose Estimation
PaperPDFCode(official)

Abstract

Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort. Existing literature focuses on 3D hand pose input, which requires computationally intensive depth estimation networks or wearing an uncomfortable depth sensor. In contrast, there has been insufficient research in understanding 2D hand pose for egocentric action recognition, despite the availability of user-friendly smart glasses in the market capable of capturing a single RGB image. Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions. Firstly, we introduce two novel approaches for 2D hand pose estimation, namely EffHandNet for single-hand estimation and EffHandEgoNet, tailored for an egocentric perspective, capturing interactions between hands and objects. Both methods outperform state-of-the-art models on H2O and FPHA public benchmarks. Secondly, we present a robust action recognition architecture from 2D hand and object poses. This method incorporates EffHandEgoNet, and a transformer-based action recognition method. Evaluated on H2O and FPHA datasets, our architecture has a faster inference time and achieves an accuracy of 91.32% and 94.43%, respectively, surpassing state of the art, including 3D-based methods. Our work demonstrates that using 2D skeletal data is a robust approach for egocentric action understanding. Extensive evaluation and ablation studies show the impact of the hand pose estimation approach, and how each input affects the overall performance.

Results

TaskDatasetMetricValueModel
VideoFirst-Person Hand Action Benchmark1:1 Accuracy94.43EffHandEgoNet
VideoH2O (2 Hands and Objects)Accuracy91.32EffHandEgoNet
Temporal Action LocalizationFirst-Person Hand Action Benchmark1:1 Accuracy94.43EffHandEgoNet
Temporal Action LocalizationH2O (2 Hands and Objects)Accuracy91.32EffHandEgoNet
Zero-Shot LearningFirst-Person Hand Action Benchmark1:1 Accuracy94.43EffHandEgoNet
Zero-Shot LearningH2O (2 Hands and Objects)Accuracy91.32EffHandEgoNet
Activity RecognitionH2O (2 Hands and Objects)Actions Top-191.32EffHandEgoNet
Activity RecognitionFirst-Person Hand Action Benchmark1:1 Accuracy94.43EffHandEgoNet
Activity RecognitionH2O (2 Hands and Objects)Accuracy91.32EffHandEgoNet
Action LocalizationFirst-Person Hand Action Benchmark1:1 Accuracy94.43EffHandEgoNet
Action LocalizationH2O (2 Hands and Objects)Accuracy91.32EffHandEgoNet
Action DetectionFirst-Person Hand Action Benchmark1:1 Accuracy94.43EffHandEgoNet
Action DetectionH2O (2 Hands and Objects)Accuracy91.32EffHandEgoNet
3D Action RecognitionFirst-Person Hand Action Benchmark1:1 Accuracy94.43EffHandEgoNet
3D Action RecognitionH2O (2 Hands and Objects)Accuracy91.32EffHandEgoNet
Action RecognitionH2O (2 Hands and Objects)Actions Top-191.32EffHandEgoNet
Action RecognitionFirst-Person Hand Action Benchmark1:1 Accuracy94.43EffHandEgoNet
Action RecognitionH2O (2 Hands and Objects)Accuracy91.32EffHandEgoNet

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08