Hierarchical Feature Aggregation Networks for Video Action Recognition

Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz

2019-05-29Action Recognition Temporal Action Localization

Abstract

Most action recognition methods base on a) a late aggregation of frame level CNN features using average pooling, max pooling, or RNN, among others, or b) spatio-temporal aggregation via 3D convolutions. The first assume independence among frame features up to a certain level of abstraction and then perform higher-level aggregation, while the second extracts spatio-temporal features from grouped frames as early fusion. In this paper we explore the space in between these two, by letting adjacent feature branches interact as they develop into the higher level representation. The interaction happens between feature differencing and averaging at each level of the hierarchy, and it has convolutional structure that learns to select the appropriate mode locally in contrast to previous works that impose one of the modes globally (e.g. feature differencing) as a design choice. We further constrain this interaction to be conservative, e.g. a local feature subtraction in one branch is compensated by the addition on another, such that the total feature flow is preserved. We evaluate the performance of our proposal on a number of existing models, i.e. TSN, TRN and ECO, to show its flexibility and effectiveness in improving action recognition performance.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	HMDB-51	Average accuracy of 3 splits	71.13	HF-ECOLite (ImageNet+Kinetics pretrain)
Activity Recognition	Something-Something V1	Top 1 Accuracy	41.97	HF-TSN (ImageNet pretraining)
Action Recognition	HMDB-51	Average accuracy of 3 splits	71.13	HF-ECOLite (ImageNet+Kinetics pretrain)
Action Recognition	Something-Something V1	Top 1 Accuracy	41.97	HF-TSN (ImageNet pretraining)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16 Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01 EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26 Feature Hallucination for Self-supervised Action Recognition2025-06-25 CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25 Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23 Adapting Vision-Language Models for Evaluating World Models2025-06-22