Oluwaleke Yusuf, Maki Habib, Mohamed Moustafa
Hand Gesture Recognition (HGR) enables intuitive human-computer interactions in various real-world contexts. However, existing frameworks often struggle to meet the real-time requirements essential for practical HGR applications. This study introduces a robust, skeleton-based framework for dynamic HGR that simplifies the recognition of dynamic hand gestures into a static image classification task, effectively reducing both hardware and computational demands. Our framework utilizes a data-level fusion technique to encode 3D skeleton data from dynamic gestures into static RGB spatiotemporal images. It incorporates a specialized end-to-end Ensemble Tuner (e2eET) Multi-Stream CNN architecture that optimizes the semantic connections between data representations while minimizing computational needs. Tested across five benchmark datasets (SHREC'17, DHG-14/28, FPHA, LMDHG, and CNR), the framework showed competitive performance with the state-of-the-art. Its capability to support real-time HGR applications was also demonstrated through deployment on standard consumer PC hardware, showcasing low latency and minimal resource usage in real-world settings. The successful deployment of this framework underscores its potential to enhance real-time applications in fields such as virtual/augmented reality, ambient intelligence, and assistive technologies, providing a scalable and efficient solution for dynamic gesture recognition.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | SBU / SBU-Refine | Accuracy | 93.96 | e2eET |
| Video | First-Person Hand Action Benchmark | 1:1 Accuracy | 91.83 | e2eET |
| Temporal Action Localization | SBU / SBU-Refine | Accuracy | 93.96 | e2eET |
| Temporal Action Localization | First-Person Hand Action Benchmark | 1:1 Accuracy | 91.83 | e2eET |
| Zero-Shot Learning | SBU / SBU-Refine | Accuracy | 93.96 | e2eET |
| Zero-Shot Learning | First-Person Hand Action Benchmark | 1:1 Accuracy | 91.83 | e2eET |
| Activity Recognition | SBU / SBU-Refine | Accuracy | 93.96 | e2eET |
| Activity Recognition | First-Person Hand Action Benchmark | 1:1 Accuracy | 91.83 | e2eET |
| Action Localization | SBU / SBU-Refine | Accuracy | 93.96 | e2eET |
| Action Localization | First-Person Hand Action Benchmark | 1:1 Accuracy | 91.83 | e2eET |
| Hand | DHG-28 | Accuracy | 92.38 | e2eET |
| Hand | SHREC 2017 | 14 Gestures Accuracy | 97.86 | e2eET |
| Hand | SHREC 2017 | 28 Gestures Accuracy | 95.36 | e2eET |
| Hand | DHG-14 | Accuracy | 95.83 | e2eET |
| Action Detection | SBU / SBU-Refine | Accuracy | 93.96 | e2eET |
| Action Detection | First-Person Hand Action Benchmark | 1:1 Accuracy | 91.83 | e2eET |
| Gesture Recognition | DHG-28 | Accuracy | 92.38 | e2eET |
| Gesture Recognition | SHREC 2017 | 14 Gestures Accuracy | 97.86 | e2eET |
| Gesture Recognition | SHREC 2017 | 28 Gestures Accuracy | 95.36 | e2eET |
| Gesture Recognition | DHG-14 | Accuracy | 95.83 | e2eET |
| 3D Action Recognition | SBU / SBU-Refine | Accuracy | 93.96 | e2eET |
| 3D Action Recognition | First-Person Hand Action Benchmark | 1:1 Accuracy | 91.83 | e2eET |
| Action Recognition | SBU / SBU-Refine | Accuracy | 93.96 | e2eET |
| Action Recognition | First-Person Hand Action Benchmark | 1:1 Accuracy | 91.83 | e2eET |