TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Convolutional Two-Stream Network Fusion for Video Action R...

Convolutional Two-Stream Network Fusion for Video Action Recognition

Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

2016-04-22CVPR 2016 6Action RecognitionAction Recognition In VideosTemporal Action LocalizationVocal Bursts Valence Prediction
PaperPDFCodeCode(official)

Abstract

Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.

Results

TaskDatasetMetricValueModel
Activity RecognitionHMDB-51Average accuracy of 3 splits65.4S:VGG-16, T:VGG-16 (ImageNet pretrained)
Activity RecognitionUCF1013-fold Accuracy92.5S:VGG-16, T:VGG-16 (ImageNet pretrain)
Action RecognitionHMDB-51Average accuracy of 3 splits65.4S:VGG-16, T:VGG-16 (ImageNet pretrained)
Action RecognitionUCF1013-fold Accuracy92.5S:VGG-16, T:VGG-16 (ImageNet pretrain)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22