Learning spatio-temporal representations with temporal squeeze pooling

Guoxi Huang, Adrian G. Bors

2020-02-11Representation Learning Video Classification General Classification Action Recognition Classification

Abstract

In this paper, we propose a new video representation learning method, named Temporal Squeeze (TS) pooling, which can extract the essential movement information from a long sequence of video frames and map it into a set of few images, named Squeezed Images. By embedding the Temporal Squeeze pooling as a layer into off-the-shelf Convolution Neural Networks (CNN), we design a new video classification model, named Temporal Squeeze Network (TeSNet). The resulting Squeezed Images contain the essential movement information from the video frames, corresponding to the optimization of the video classification task. We evaluate our architecture on two video classification benchmarks, and the results achieved are compared to the state-of-the-art.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	HMDB-51	Average accuracy of 3 splits	71.5	TesNet (ImageNet pretrained)
Activity Recognition	UCF101	3-fold Accuracy	95.2	TesNet (ImageNet pretrained)
Action Recognition	HMDB-51	Average accuracy of 3 splits	71.5	TesNet (ImageNet pretrained)
Action Recognition	UCF101	3-fold Accuracy	95.2	TesNet (ImageNet pretrained)

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20 Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17 Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17 Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16 Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16