TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Deep set conditioned latent representations for action rec...

Deep set conditioned latent representations for action recognition

Akash Singh, Tom De Schepper, Kevin Mets, Peter Hellinckx, Jose Oramas, Steven Latre

2022-12-21International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications 2022 2Composite action recognitionAction RecognitionAtomic action recognitionTemporal Action Localization
PaperPDF

Abstract

In recent years multi-label, multi-class video action recognition has gained significant popularity. While reasoning over temporally connected atomic actions is mundane for intelligent species, standard artificial neural networks (ANN) still struggle to classify them. In the real world, atomic actions often temporally connect to form more complex composite actions. The challenge lies in recognising composite action of varying durations while other distinct composite or atomic actions occur in the background. Drawing upon the success of relational networks, we propose methods that learn to reason over the semantic concept of objects and actions. We empirically show how ANNs benefit from pretraining, relational inductive biases and unordered set-based latent representations. In this paper we propose deep set conditioned I3D (SCI3D), a two stream relational network that employs latent representation of state and visual representation for reasoning over events and actions. They learn to reason about temporally connected actions in order to identify all of them in the video. The proposed method achieves an improvement of around 1.49% mAP in atomic action recognition and 17.57% mAP in composite action recognition, over a I3D-NL baseline, on the CATER dataset.

Results

TaskDatasetMetricValueModel
Activity RecognitionCATERAverage-mAP96.77SCI3D
Activity RecognitionCATERAverage-mAP95.28R3D-NL
Activity RecognitionCATERAverage-mAP91.82Single stream SCI3D
Activity RecognitionCATERAverage-mAP63.85FasterRCNN
Activity RecognitionCATERAverage-mAP69.76Single stream SCI3D
Activity RecognitionCATERAverage-mAP66.71SCI3D
Activity RecognitionCATERAverage-mAP52.19R3D-NL
Activity RecognitionCATERAverage-mAP25.45FasterRCNN
Action RecognitionCATERAverage-mAP96.77SCI3D
Action RecognitionCATERAverage-mAP95.28R3D-NL
Action RecognitionCATERAverage-mAP91.82Single stream SCI3D
Action RecognitionCATERAverage-mAP63.85FasterRCNN
Action RecognitionCATERAverage-mAP69.76Single stream SCI3D
Action RecognitionCATERAverage-mAP66.71SCI3D
Action RecognitionCATERAverage-mAP52.19R3D-NL
Action RecognitionCATERAverage-mAP25.45FasterRCNN
Atomic action recognitionCATERAverage-mAP96.77SCI3D
Atomic action recognitionCATERAverage-mAP95.28R3D-NL
Atomic action recognitionCATERAverage-mAP91.82Single stream SCI3D
Atomic action recognitionCATERAverage-mAP63.85FasterRCNN
Atomic action recognitionCATERAverage-mAP69.76Single stream SCI3D
Atomic action recognitionCATERAverage-mAP66.71SCI3D
Atomic action recognitionCATERAverage-mAP52.19R3D-NL
Atomic action recognitionCATERAverage-mAP25.45FasterRCNN

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22