TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Chained Multi-stream Networks Exploiting Pose, Motion, and...

Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection

Mohammadreza Zolfaghari, Gabriel L. Oliveira, Nima Sedaghat, Thomas Brox

2017-04-03ICCV 2017 10Action ClassificationAction LocalizationSkeleton Based Action RecognitionSpatio-Temporal Action LocalizationGeneral ClassificationAction RecognitionTemporal Action Localization
PaperPDFCode

Abstract

General human action recognition requires understanding of various visual cues. In this paper, we propose a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images. For the integration, we introduce a Markov chain model which adds cues successively. The resulting approach is efficient and applicable to action classification as well as to spatial and temporal action localization. The two contributions clearly improve the performance over respective baselines. The overall approach achieves state-of-the-art action classification performance on HMDB51, J-HMDB and NTU RGB+D datasets. Moreover, it yields state-of-the-art spatio-temporal action localization results on UCF101 and J-HMDB.

Results

TaskDatasetMetricValueModel
VideoJHMDB (2D poses only)Average accuracy of 3 splits56.8Chained
VideoJ-HMDBAccuracy (RGB+pose)76.1Chained (RGB+Flow +Pose)
VideoJ-HMDBAccuracy (pose)56.8Chained (RGB+Flow +Pose)
Temporal Action LocalizationJHMDB (2D poses only)Average accuracy of 3 splits56.8Chained
Temporal Action LocalizationJ-HMDBAccuracy (RGB+pose)76.1Chained (RGB+Flow +Pose)
Temporal Action LocalizationJ-HMDBAccuracy (pose)56.8Chained (RGB+Flow +Pose)
Zero-Shot LearningJHMDB (2D poses only)Average accuracy of 3 splits56.8Chained
Zero-Shot LearningJ-HMDBAccuracy (RGB+pose)76.1Chained (RGB+Flow +Pose)
Zero-Shot LearningJ-HMDBAccuracy (pose)56.8Chained (RGB+Flow +Pose)
Activity RecognitionJHMDB (2D poses only)Average accuracy of 3 splits56.8Chained
Activity RecognitionJ-HMDBAccuracy (RGB+pose)76.1Chained (RGB+Flow +Pose)
Activity RecognitionJ-HMDBAccuracy (pose)56.8Chained (RGB+Flow +Pose)
Action LocalizationJHMDB (2D poses only)Average accuracy of 3 splits56.8Chained
Action LocalizationJ-HMDBAccuracy (RGB+pose)76.1Chained (RGB+Flow +Pose)
Action LocalizationJ-HMDBAccuracy (pose)56.8Chained (RGB+Flow +Pose)
Action DetectionJHMDB (2D poses only)Average accuracy of 3 splits56.8Chained
Action DetectionJ-HMDBAccuracy (RGB+pose)76.1Chained (RGB+Flow +Pose)
Action DetectionJ-HMDBAccuracy (pose)56.8Chained (RGB+Flow +Pose)
3D Action RecognitionJHMDB (2D poses only)Average accuracy of 3 splits56.8Chained
3D Action RecognitionJ-HMDBAccuracy (RGB+pose)76.1Chained (RGB+Flow +Pose)
3D Action RecognitionJ-HMDBAccuracy (pose)56.8Chained (RGB+Flow +Pose)
Action RecognitionJHMDB (2D poses only)Average accuracy of 3 splits56.8Chained
Action RecognitionJ-HMDBAccuracy (RGB+pose)76.1Chained (RGB+Flow +Pose)
Action RecognitionJ-HMDBAccuracy (pose)56.8Chained (RGB+Flow +Pose)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22