TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Tensor Representations for Action Recognition

Tensor Representations for Action Recognition

Piotr Koniusz, Lei Wang, Anoop Cherian

2020-12-28Skeleton Based Action RecognitionAction RecognitionAction Recognition In Videos
PaperPDFCode

Abstract

Human actions in video sequences are characterized by the complex interplay between spatial features and their temporal dynamics. In this paper, we propose novel tensor representations for compactly capturing such higher-order relationships between visual features for the task of action recognition. We propose two tensor-based feature representations, viz. (i) sequence compatibility kernel (SCK) and (ii) dynamics compatibility kernel (DCK). SCK builds on the spatio-temporal correlations between features, whereas DCK explicitly models the action dynamics of a sequence. We also explore generalization of SCK, coined SCK(+), that operates on subsequences to capture the local-global interplay of correlations, which can incorporate multi-modal inputs e.g., skeleton 3D body-joints and per-frame classifier scores obtained from deep learning models trained on videos. We introduce linearization of these kernels that lead to compact and fast descriptors. We provide experiments on (i) 3D skeleton action sequences, (ii) fine-grained video sequences, and (iii) standard non-fine-grained videos. As our final representations are tensors that capture higher-order relationships of features, they relate to co-occurrences for robust fine-grained recognition. We use higher-order tensors and so-called Eigenvalue Power Normalization (EPN) which have been long speculated to perform spectral detection of higher-order occurrences, thus detecting fine-grained relationships of features rather than merely count features in action sequences. We prove that a tensor of order r, built from Z* dimensional features, coupled with EPN indeed detects if at least one higher-order occurrence is `projected' into one of its binom(Z*,r) subspaces of dim. r represented by the tensor, thus forming a Tensor Power Normalization metric endowed with binom(Z*,r) such `detectors'.

Results

TaskDatasetMetricValueModel
VideoFlorence 3DAccuracy97.45SCK⊕+DCK⊕
VideoFlorence 3DAccuracy95.23SCK+DCK
VideoUT-KinectAccuracy99.2SCK⊕+DCK⊕
VideoUT-KinectAccuracy98.2SCK+DCK
VideoNTU RGB+DAccuracy (CS)91.56SCK⊕
VideoNTU RGB+DAccuracy (CV)94.75SCK⊕
Temporal Action LocalizationFlorence 3DAccuracy97.45SCK⊕+DCK⊕
Temporal Action LocalizationFlorence 3DAccuracy95.23SCK+DCK
Temporal Action LocalizationUT-KinectAccuracy99.2SCK⊕+DCK⊕
Temporal Action LocalizationUT-KinectAccuracy98.2SCK+DCK
Temporal Action LocalizationNTU RGB+DAccuracy (CS)91.56SCK⊕
Temporal Action LocalizationNTU RGB+DAccuracy (CV)94.75SCK⊕
Zero-Shot LearningFlorence 3DAccuracy97.45SCK⊕+DCK⊕
Zero-Shot LearningFlorence 3DAccuracy95.23SCK+DCK
Zero-Shot LearningUT-KinectAccuracy99.2SCK⊕+DCK⊕
Zero-Shot LearningUT-KinectAccuracy98.2SCK+DCK
Zero-Shot LearningNTU RGB+DAccuracy (CS)91.56SCK⊕
Zero-Shot LearningNTU RGB+DAccuracy (CV)94.75SCK⊕
Activity RecognitionHMDB-51Average accuracy of 3 splits86.11SCK⊕(I3D)+IDT
Activity RecognitionFlorence 3DAccuracy97.45SCK⊕+DCK⊕
Activity RecognitionFlorence 3DAccuracy95.23SCK+DCK
Activity RecognitionUT-KinectAccuracy99.2SCK⊕+DCK⊕
Activity RecognitionUT-KinectAccuracy98.2SCK+DCK
Activity RecognitionNTU RGB+DAccuracy (CS)91.56SCK⊕
Activity RecognitionNTU RGB+DAccuracy (CV)94.75SCK⊕
Action LocalizationFlorence 3DAccuracy97.45SCK⊕+DCK⊕
Action LocalizationFlorence 3DAccuracy95.23SCK+DCK
Action LocalizationUT-KinectAccuracy99.2SCK⊕+DCK⊕
Action LocalizationUT-KinectAccuracy98.2SCK+DCK
Action LocalizationNTU RGB+DAccuracy (CS)91.56SCK⊕
Action LocalizationNTU RGB+DAccuracy (CV)94.75SCK⊕
Action DetectionFlorence 3DAccuracy97.45SCK⊕+DCK⊕
Action DetectionFlorence 3DAccuracy95.23SCK+DCK
Action DetectionUT-KinectAccuracy99.2SCK⊕+DCK⊕
Action DetectionUT-KinectAccuracy98.2SCK+DCK
Action DetectionNTU RGB+DAccuracy (CS)91.56SCK⊕
Action DetectionNTU RGB+DAccuracy (CV)94.75SCK⊕
3D Action RecognitionFlorence 3DAccuracy97.45SCK⊕+DCK⊕
3D Action RecognitionFlorence 3DAccuracy95.23SCK+DCK
3D Action RecognitionUT-KinectAccuracy99.2SCK⊕+DCK⊕
3D Action RecognitionUT-KinectAccuracy98.2SCK+DCK
3D Action RecognitionNTU RGB+DAccuracy (CS)91.56SCK⊕
3D Action RecognitionNTU RGB+DAccuracy (CV)94.75SCK⊕
Action RecognitionHMDB-51Average accuracy of 3 splits86.11SCK⊕(I3D)+IDT
Action RecognitionFlorence 3DAccuracy97.45SCK⊕+DCK⊕
Action RecognitionFlorence 3DAccuracy95.23SCK+DCK
Action RecognitionUT-KinectAccuracy99.2SCK⊕+DCK⊕
Action RecognitionUT-KinectAccuracy98.2SCK+DCK
Action RecognitionNTU RGB+DAccuracy (CS)91.56SCK⊕
Action RecognitionNTU RGB+DAccuracy (CV)94.75SCK⊕

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22Active Multimodal Distillation for Few-shot Action Recognition2025-06-16