TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PERF-Net: Pose Empowered RGB-Flow Net

PERF-Net: Pose Empowered RGB-Flow Net

Yinxiao Li, Zhichao Lu, Xuehan Xiong, Jonathan Huang

2020-09-28Action ClassificationAction RecognitionTemporal Action Localization
PaperPDF

Abstract

In recent years, many works in the video action recognition literature have shown that two stream models (combining spatial and temporal input streams) are necessary for achieving state of the art performance. In this paper we show the benefits of including yet another stream based on human pose estimated from each frame -- specifically by rendering pose on input RGB frames. At first blush, this additional stream may seem redundant given that human pose is fully determined by RGB pixel values -- however we show (perhaps surprisingly) that this simple and flexible addition can provide complementary gains. Using this insight, we then propose a new model, which we dub PERF-Net (short for Pose Empowered RGB-Flow Net), which combines this new pose stream with the standard RGB and flow based input streams via distillation techniques and show that our model outperforms the state-of-the-art by a large margin in a number of human action recognition datasets while not requiring flow or pose to be explicitly computed at inference time. The proposed pose stream is also part of the winner solution of the ActivityNet Kinetics Challenge 2020.

Results

TaskDatasetMetricValueModel
VideoKinetics-600Top-1 Accuracy82PERF-Net (distilled ResNet50-G)
VideoKinetics-600Top-5 Accuracy95.7PERF-Net (distilled ResNet50-G)
Activity RecognitionHMDB-51Average accuracy of 3 splits83.2PERF-Net (distilled S3D-G)
Activity RecognitionUCF1013-fold Accuracy98.6PERF-Net (multi-distilled S3D)
Action RecognitionHMDB-51Average accuracy of 3 splits83.2PERF-Net (distilled S3D-G)
Action RecognitionUCF1013-fold Accuracy98.6PERF-Net (multi-distilled S3D)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22