TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DirecFormer: A Directed Attention in Transformer Approach ...

DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition

Thanh-Dat Truong, Quoc-Huy Bui, Chi Nhan Duong, Han-Seok Seo, Son Lam Phung, Xin Li, Khoa Luu

2022-03-19CVPR 2022 1Action ClassificationGesture RecognitionAction RecognitionAction Recognition In VideosTemporal Action Localization
PaperPDFCode(official)

Abstract

Human action recognition has recently become one of the popular research topics in the computer vision community. Various 3D-CNN based methods have been presented to tackle both the spatial and temporal dimensions in the task of video action recognition with competitive results. However, these methods have suffered some fundamental limitations such as lack of robustness and generalization, e.g., how does the temporal ordering of video frames affect the recognition results? This work presents a novel end-to-end Transformer-based Directed Attention (DirecFormer) framework for robust action recognition. The method takes a simple but novel perspective of Transformer-based approach to understand the right order of sequence actions. Therefore, the contributions of this work are three-fold. Firstly, we introduce the problem of ordered temporal learning issues to the action recognition problem. Secondly, a new Directed Attention mechanism is introduced to understand and provide attentions to human actions in the right order. Thirdly, we introduce the conditional dependency in action sequence modeling that includes orders and classes. The proposed approach consistently achieves the state-of-the-art (SOTA) results compared with the recent action recognition methods, on three standard large-scale benchmarks, i.e. Jester, Kinetics-400 and Something-Something-V2.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@182.75DirecFormer
VideoKinetics-400Acc@594.86DirecFormer
Activity RecognitionJester (Gesture Recognition)Val98.15DirecFormer
Activity RecognitionSomething-Something V2Top-1 Accuracy64.94DirecFormer
Activity RecognitionSomething-Something V2Top-5 Accuracy87.9DirecFormer
Action RecognitionJester (Gesture Recognition)Val98.15DirecFormer
Action RecognitionSomething-Something V2Top-1 Accuracy64.94DirecFormer
Action RecognitionSomething-Something V2Top-5 Accuracy87.9DirecFormer

Related Papers

Efficient Deployment of Spiking Neural Networks on SpiNNaker2 for DVS Gesture Recognition Using Neuromorphic Intermediate Representation2025-09-04A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Visual Hand Gesture Recognition with Deep Learning: A Comprehensive Review of Methods, Datasets, Challenges and Future Research Directions2025-07-06Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction?2025-06-25Feature Hallucination for Self-supervised Action Recognition2025-06-25