Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos

Yilin Wen, Hao Pan, Lei Yang, Jia Pan, Taku Komura, Wenping Wang

2022-09-20CVPR 2023 13D Hand Pose Estimation Pose Estimation Action Recognition Hand Pose Estimation

Abstract

Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	H2O (2 Hands and Objects)	Actions Top-1	86.36	HTT
Action Recognition	H2O (2 Hands and Objects)	Actions Top-1	86.36	HTT

Related Papers

$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17 Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17 DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model2025-07-17 From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation2025-07-17 AthleticsPose: Authentic Sports Motion Dataset on Athletic Field and Evaluation of Monocular 3D Pose Estimation Ability2025-07-17 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 SpatialTrackerV2: 3D Point Tracking Made Easy2025-07-16 SGLoc: Semantic Localization System for Camera Pose Estimation from 3D Gaussian Splatting Representation2025-07-16