TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PAN: Towards Fast Action Recognition via Learning Persiste...

PAN: Towards Fast Action Recognition via Learning Persistence of Appearance

Can Zhang, Yuexian Zou, Guang Chen, Lei Gan

2020-08-08Optical Flow EstimationVideo UnderstandingAction Recognition
PaperPDFCode(official)Code

Abstract

Efficiently modeling dynamic motion information in videos is crucial for action recognition task. Most state-of-the-art methods heavily rely on dense optical flow as motion representation. Although combining optical flow with RGB frames as input can achieve excellent recognition performance, the optical flow extraction is very time-consuming. This undoubtably will count against real-time action recognition. In this paper, we shed light on fast action recognition by lifting the reliance on optical flow. Our motivation lies in the observation that small displacements of motion boundaries are the most critical ingredients for distinguishing actions, so we design a novel motion cue called Persistence of Appearance (PA). In contrast to optical flow, our PA focuses more on distilling the motion information at boundaries. Also, it is more efficient by only accumulating pixel-wise differences in feature space, instead of using exhaustive patch-wise search of all the possible motion vectors. Our PA is over 1000x faster (8196fps vs. 8fps) than conventional optical flow in terms of motion modeling speed. To further aggregate the short-term dynamics in PA to long-term dynamics, we also devise a global temporal fusion strategy called Various-timescale Aggregation Pooling (VAP) that can adaptively model long-range temporal relationships across various timescales. We finally incorporate the proposed PA and VAP to form a unified framework called Persistent Appearance Network (PAN) with strong temporal modeling ability. Extensive experiments on six challenging action recognition benchmarks verify that our PAN outperforms recent state-of-the-art methods at low FLOPs. Codes and models are available at: https://github.com/zhang-can/PAN-PyTorch.

Results

TaskDatasetMetricValueModel
Activity RecognitionJester (Gesture Recognition)Val97.4PAN ResNet101 (RGB only, no Flow)
Activity RecognitionSomething-Something V1Top 1 Accuracy55.3PAN ResNet101 (RGB only, no Flow)
Activity RecognitionSomething-Something V1Top 5 Accuracy82.8PAN ResNet101 (RGB only, no Flow)
Activity RecognitionSomething-Something V2Top-1 Accuracy66.5PAN ResNet101 (RGB only, no Flow)
Activity RecognitionSomething-Something V2Top-5 Accuracy90.6PAN ResNet101 (RGB only, no Flow)
Action RecognitionJester (Gesture Recognition)Val97.4PAN ResNet101 (RGB only, no Flow)
Action RecognitionSomething-Something V1Top 1 Accuracy55.3PAN ResNet101 (RGB only, no Flow)
Action RecognitionSomething-Something V1Top 5 Accuracy82.8PAN ResNet101 (RGB only, no Flow)
Action RecognitionSomething-Something V2Top-1 Accuracy66.5PAN ResNet101 (RGB only, no Flow)
Action RecognitionSomething-Something V2Top-5 Accuracy90.6PAN ResNet101 (RGB only, no Flow)

Related Papers

Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14An Efficient Approach for Muscle Segmentation and 3D Reconstruction Using Keypoint Tracking in MRI Scan2025-07-11Learning to Track Any Points from Human Motion2025-07-08