TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Self-supervising Action Recognition by Statistical Moment ...

Self-supervising Action Recognition by Statistical Moment and Subspace Descriptors

Lei Wang, Piotr Koniusz

2020-01-14Action ClassificationOptical Flow EstimationScene RecognitionHallucinationEgocentric Activity RecognitionAction Recognition
PaperPDF

Abstract

In this paper, we build on a concept of self-supervision by taking RGB frames as input to learn to predict both action concepts and auxiliary descriptors e.g., object descriptors. So-called hallucination streams are trained to predict auxiliary cues, simultaneously fed into classification layers, and then hallucinated at the testing stage to aid network. We design and hallucinate two descriptors, one leveraging four popular object detectors applied to training videos, and the other leveraging image- and video-level saliency detectors. The first descriptor encodes the detector- and ImageNet-wise class prediction scores, confidence scores, and spatial locations of bounding boxes and frame indexes to capture the spatio-temporal distribution of features per video. Another descriptor encodes spatio-angular gradient distributions of saliency maps and intensity patterns. Inspired by the characteristic function of the probability distribution, we capture four statistical moments on the above intermediate descriptors. As numbers of coefficients in the mean, covariance, coskewness and cokurtotsis grow linearly, quadratically, cubically and quartically w.r.t. the dimension of feature vectors, we describe the covariance matrix by its leading n' eigenvectors (so-called subspace) and we capture skewness/kurtosis rather than costly coskewness/cokurtosis. We obtain state of the art on five popular datasets such as Charades and EPIC-Kitchens.

Results

TaskDatasetMetricValueModel
VideoCharadesMAP62.29DEEP-HAL with ODF+SDF (AssembleNet++)
VideoCharadesMAP50.16DEEP-HAL with ODF+SDF (I3D)
Scene ParsingYUP++Accuracy (%)94.4DEEP-HAL with ODF+SDF (I3D)
Activity RecognitionHMDB-51Average accuracy of 3 splits87.56DEEP-HAL with ODF+SDF(I3D)
Activity RecognitionEPIC-KITCHENS-55Actions Top-1 (S1)35.8DEEP-HAL with ODF+SDF (AssembleNet++)
Activity RecognitionEPIC-KITCHENS-55Actions Top-1 (S2)27.3DEEP-HAL with ODF+SDF (AssembleNet++)
AnimationYUP++Accuracy (%)94.4DEEP-HAL with ODF+SDF (I3D)
Action RecognitionHMDB-51Average accuracy of 3 splits87.56DEEP-HAL with ODF+SDF(I3D)
3D Character Animation From A Single PhotoYUP++Accuracy (%)94.4DEEP-HAL with ODF+SDF (I3D)
2D Semantic SegmentationYUP++Accuracy (%)94.4DEEP-HAL with ODF+SDF (I3D)

Related Papers

Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16An Efficient Approach for Muscle Segmentation and 3D Reconstruction Using Keypoint Tracking in MRI Scan2025-07-11ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way2025-07-11Learning to Track Any Points from Human Motion2025-07-08UQLM: A Python Package for Uncertainty Quantification in Large Language Models2025-07-08TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation2025-07-07