Deep set conditioned latent representations for action recognition

Akash Singh, Tom De Schepper, Kevin Mets, Peter Hellinckx, Jose Oramas, Steven Latre

2022-12-21International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications 2022 2Composite action recognition Action Recognition Atomic action recognition Temporal Action Localization

Paper PDF

Abstract

In recent years multi-label, multi-class video action recognition has gained significant popularity. While reasoning over temporally connected atomic actions is mundane for intelligent species, standard artificial neural networks (ANN) still struggle to classify them. In the real world, atomic actions often temporally connect to form more complex composite actions. The challenge lies in recognising composite action of varying durations while other distinct composite or atomic actions occur in the background. Drawing upon the success of relational networks, we propose methods that learn to reason over the semantic concept of objects and actions. We empirically show how ANNs benefit from pretraining, relational inductive biases and unordered set-based latent representations. In this paper we propose deep set conditioned I3D (SCI3D), a two stream relational network that employs latent representation of state and visual representation for reasoning over events and actions. They learn to reason about temporally connected actions in order to identify all of them in the video. The proposed method achieves an improvement of around 1.49% mAP in atomic action recognition and 17.57% mAP in composite action recognition, over a I3D-NL baseline, on the CATER dataset.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	CATER	Average-mAP	96.77	SCI3D
Activity Recognition	CATER	Average-mAP	95.28	R3D-NL
Activity Recognition	CATER	Average-mAP	91.82	Single stream SCI3D
Activity Recognition	CATER	Average-mAP	63.85	FasterRCNN
Activity Recognition	CATER	Average-mAP	69.76	Single stream SCI3D
Activity Recognition	CATER	Average-mAP	66.71	SCI3D
Activity Recognition	CATER	Average-mAP	52.19	R3D-NL
Activity Recognition	CATER	Average-mAP	25.45	FasterRCNN
Action Recognition	CATER	Average-mAP	96.77	SCI3D
Action Recognition	CATER	Average-mAP	95.28	R3D-NL
Action Recognition	CATER	Average-mAP	91.82	Single stream SCI3D
Action Recognition	CATER	Average-mAP	63.85	FasterRCNN
Action Recognition	CATER	Average-mAP	69.76	Single stream SCI3D
Action Recognition	CATER	Average-mAP	66.71	SCI3D
Action Recognition	CATER	Average-mAP	52.19	R3D-NL
Action Recognition	CATER	Average-mAP	25.45	FasterRCNN
Atomic action recognition	CATER	Average-mAP	96.77	SCI3D
Atomic action recognition	CATER	Average-mAP	95.28	R3D-NL
Atomic action recognition	CATER	Average-mAP	91.82	Single stream SCI3D
Atomic action recognition	CATER	Average-mAP	63.85	FasterRCNN
Atomic action recognition	CATER	Average-mAP	69.76	Single stream SCI3D
Atomic action recognition	CATER	Average-mAP	66.71	SCI3D
Atomic action recognition	CATER	Average-mAP	52.19	R3D-NL
Atomic action recognition	CATER	Average-mAP	25.45	FasterRCNN

Deep set conditioned latent representations for action recognition

Abstract

Results

Related Papers

Deep set conditioned latent representations for action recognition

Abstract

Results

Related Papers