Piotr Koniusz, Lei Wang, Anoop Cherian
Human actions in video sequences are characterized by the complex interplay between spatial features and their temporal dynamics. In this paper, we propose novel tensor representations for compactly capturing such higher-order relationships between visual features for the task of action recognition. We propose two tensor-based feature representations, viz. (i) sequence compatibility kernel (SCK) and (ii) dynamics compatibility kernel (DCK). SCK builds on the spatio-temporal correlations between features, whereas DCK explicitly models the action dynamics of a sequence. We also explore generalization of SCK, coined SCK(+), that operates on subsequences to capture the local-global interplay of correlations, which can incorporate multi-modal inputs e.g., skeleton 3D body-joints and per-frame classifier scores obtained from deep learning models trained on videos. We introduce linearization of these kernels that lead to compact and fast descriptors. We provide experiments on (i) 3D skeleton action sequences, (ii) fine-grained video sequences, and (iii) standard non-fine-grained videos. As our final representations are tensors that capture higher-order relationships of features, they relate to co-occurrences for robust fine-grained recognition. We use higher-order tensors and so-called Eigenvalue Power Normalization (EPN) which have been long speculated to perform spectral detection of higher-order occurrences, thus detecting fine-grained relationships of features rather than merely count features in action sequences. We prove that a tensor of order r, built from Z* dimensional features, coupled with EPN indeed detects if at least one higher-order occurrence is `projected' into one of its binom(Z*,r) subspaces of dim. r represented by the tensor, thus forming a Tensor Power Normalization metric endowed with binom(Z*,r) such `detectors'.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Florence 3D | Accuracy | 97.45 | SCK⊕+DCK⊕ |
| Video | Florence 3D | Accuracy | 95.23 | SCK+DCK |
| Video | UT-Kinect | Accuracy | 99.2 | SCK⊕+DCK⊕ |
| Video | UT-Kinect | Accuracy | 98.2 | SCK+DCK |
| Video | NTU RGB+D | Accuracy (CS) | 91.56 | SCK⊕ |
| Video | NTU RGB+D | Accuracy (CV) | 94.75 | SCK⊕ |
| Temporal Action Localization | Florence 3D | Accuracy | 97.45 | SCK⊕+DCK⊕ |
| Temporal Action Localization | Florence 3D | Accuracy | 95.23 | SCK+DCK |
| Temporal Action Localization | UT-Kinect | Accuracy | 99.2 | SCK⊕+DCK⊕ |
| Temporal Action Localization | UT-Kinect | Accuracy | 98.2 | SCK+DCK |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 91.56 | SCK⊕ |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 94.75 | SCK⊕ |
| Zero-Shot Learning | Florence 3D | Accuracy | 97.45 | SCK⊕+DCK⊕ |
| Zero-Shot Learning | Florence 3D | Accuracy | 95.23 | SCK+DCK |
| Zero-Shot Learning | UT-Kinect | Accuracy | 99.2 | SCK⊕+DCK⊕ |
| Zero-Shot Learning | UT-Kinect | Accuracy | 98.2 | SCK+DCK |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 91.56 | SCK⊕ |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 94.75 | SCK⊕ |
| Activity Recognition | HMDB-51 | Average accuracy of 3 splits | 86.11 | SCK⊕(I3D)+IDT |
| Activity Recognition | Florence 3D | Accuracy | 97.45 | SCK⊕+DCK⊕ |
| Activity Recognition | Florence 3D | Accuracy | 95.23 | SCK+DCK |
| Activity Recognition | UT-Kinect | Accuracy | 99.2 | SCK⊕+DCK⊕ |
| Activity Recognition | UT-Kinect | Accuracy | 98.2 | SCK+DCK |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 91.56 | SCK⊕ |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 94.75 | SCK⊕ |
| Action Localization | Florence 3D | Accuracy | 97.45 | SCK⊕+DCK⊕ |
| Action Localization | Florence 3D | Accuracy | 95.23 | SCK+DCK |
| Action Localization | UT-Kinect | Accuracy | 99.2 | SCK⊕+DCK⊕ |
| Action Localization | UT-Kinect | Accuracy | 98.2 | SCK+DCK |
| Action Localization | NTU RGB+D | Accuracy (CS) | 91.56 | SCK⊕ |
| Action Localization | NTU RGB+D | Accuracy (CV) | 94.75 | SCK⊕ |
| Action Detection | Florence 3D | Accuracy | 97.45 | SCK⊕+DCK⊕ |
| Action Detection | Florence 3D | Accuracy | 95.23 | SCK+DCK |
| Action Detection | UT-Kinect | Accuracy | 99.2 | SCK⊕+DCK⊕ |
| Action Detection | UT-Kinect | Accuracy | 98.2 | SCK+DCK |
| Action Detection | NTU RGB+D | Accuracy (CS) | 91.56 | SCK⊕ |
| Action Detection | NTU RGB+D | Accuracy (CV) | 94.75 | SCK⊕ |
| 3D Action Recognition | Florence 3D | Accuracy | 97.45 | SCK⊕+DCK⊕ |
| 3D Action Recognition | Florence 3D | Accuracy | 95.23 | SCK+DCK |
| 3D Action Recognition | UT-Kinect | Accuracy | 99.2 | SCK⊕+DCK⊕ |
| 3D Action Recognition | UT-Kinect | Accuracy | 98.2 | SCK+DCK |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 91.56 | SCK⊕ |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 94.75 | SCK⊕ |
| Action Recognition | HMDB-51 | Average accuracy of 3 splits | 86.11 | SCK⊕(I3D)+IDT |
| Action Recognition | Florence 3D | Accuracy | 97.45 | SCK⊕+DCK⊕ |
| Action Recognition | Florence 3D | Accuracy | 95.23 | SCK+DCK |
| Action Recognition | UT-Kinect | Accuracy | 99.2 | SCK⊕+DCK⊕ |
| Action Recognition | UT-Kinect | Accuracy | 98.2 | SCK+DCK |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 91.56 | SCK⊕ |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 94.75 | SCK⊕ |