Gate-Shift Networks for Video Action Recognition

Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz

2019-12-01CVPR 2020 6Action Recognition

Abstract

Deep 3D CNNs for video action recognition are designed to learn powerful representations in the joint spatio-temporal feature space. In practice however, because of the large number of parameters and computations involved, they may under-perform in the lack of sufficiently large datasets for training them at scale. In this paper we introduce spatial gating in spatial-temporal decomposition of 3D kernels. We implement this concept with Gate-Shift Module (GSM). GSM is lightweight and turns a 2D-CNN into a highly efficient spatio-temporal feature extractor. With GSM plugged in, a 2D-CNN learns to adaptively route features through time and combine them, at almost no additional parameters and computational overhead. We perform an extensive evaluation of the proposed module to study its effectiveness in video action recognition, achieving state-of-the-art results on Something Something-V1 and Diving48 datasets, and obtaining competitive results on EPIC-Kitchens with far less model complexity.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	Something-Something V1	Top 1 Accuracy	55.16	GSM Ensemble InceptionV3 (ImageNet pretrained)
Activity Recognition	Something-Something V1	Top 1 Accuracy	51.68	GSM InceptionV3 (16 frames, ImageNet pretrained)
Action Recognition	Something-Something V1	Top 1 Accuracy	55.16	GSM Ensemble InceptionV3 (ImageNet pretrained)
Action Recognition	Something-Something V1	Top 1 Accuracy	51.68	GSM InceptionV3 (16 frames, ImageNet pretrained)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01 EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26 Feature Hallucination for Self-supervised Action Recognition2025-06-25 CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25 Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23 Adapting Vision-Language Models for Evaluating World Models2025-06-22 Active Multimodal Distillation for Few-shot Action Recognition2025-06-16