Learning Video Representations from Correspondence Proposals

Xingyu Liu, Joon-Young Lee, Hailin Jin

2019-05-20CVPR 2019 6Action Recognition Action Recognition In Videos

Abstract

Correspondences between frames encode rich information about dynamic content in videos. However, it is challenging to effectively capture and learn those due to their irregular structure and complex dynamics. In this paper, we propose a novel neural network that learns video representations by aggregating information from potential correspondences. This network, named $CPNet$, can learn evolving 2D fields with temporal consistency. In particular, it can effectively learn representations for videos by mixing appearance and long-range motion with an RGB-only input. We provide extensive ablation experiments to validate our model. CPNet shows stronger performance than existing methods on Kinetics and achieves the state-of-the-art performance on Something-Something and Jester. We provide analysis towards the behavior of our model and show its robustness to errors in proposals.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	Jester (Gesture Recognition)	Val	96.7	CPNet Res34, 5 CP
Activity Recognition	Something-Something V2	Top-1 Accuracy	57.65	CPNet Res34, 5 CP
Activity Recognition	Something-Something V2	Top-5 Accuracy	83.95	CPNet Res34, 5 CP
Action Recognition	Jester (Gesture Recognition)	Val	96.7	CPNet Res34, 5 CP
Action Recognition	Something-Something V2	Top-1 Accuracy	57.65	CPNet Res34, 5 CP
Action Recognition	Something-Something V2	Top-5 Accuracy	83.95	CPNet Res34, 5 CP
Action Recognition In Videos	Jester (Gesture Recognition)	Val	96.7	CPNet Res34, 5 CP
Action Recognition In Videos	Something-Something V2	Top-1 Accuracy	57.65	CPNet Res34, 5 CP
Action Recognition In Videos	Something-Something V2	Top-5 Accuracy	83.95	CPNet Res34, 5 CP

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01 EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26 Feature Hallucination for Self-supervised Action Recognition2025-06-25 CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25 Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23 Adapting Vision-Language Models for Evaluating World Models2025-06-22 Active Multimodal Distillation for Few-shot Action Recognition2025-06-16