TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning Video Representations from Correspondence Proposals

Learning Video Representations from Correspondence Proposals

Xingyu Liu, Joon-Young Lee, Hailin Jin

2019-05-20CVPR 2019 6Action RecognitionAction Recognition In Videos
PaperPDFCode(official)Code

Abstract

Correspondences between frames encode rich information about dynamic content in videos. However, it is challenging to effectively capture and learn those due to their irregular structure and complex dynamics. In this paper, we propose a novel neural network that learns video representations by aggregating information from potential correspondences. This network, named $CPNet$, can learn evolving 2D fields with temporal consistency. In particular, it can effectively learn representations for videos by mixing appearance and long-range motion with an RGB-only input. We provide extensive ablation experiments to validate our model. CPNet shows stronger performance than existing methods on Kinetics and achieves the state-of-the-art performance on Something-Something and Jester. We provide analysis towards the behavior of our model and show its robustness to errors in proposals.

Results

TaskDatasetMetricValueModel
Activity RecognitionJester (Gesture Recognition)Val96.7CPNet Res34, 5 CP
Activity RecognitionSomething-Something V2Top-1 Accuracy57.65CPNet Res34, 5 CP
Activity RecognitionSomething-Something V2Top-5 Accuracy83.95CPNet Res34, 5 CP
Action RecognitionJester (Gesture Recognition)Val96.7CPNet Res34, 5 CP
Action RecognitionSomething-Something V2Top-1 Accuracy57.65CPNet Res34, 5 CP
Action RecognitionSomething-Something V2Top-5 Accuracy83.95CPNet Res34, 5 CP
Action Recognition In VideosJester (Gesture Recognition)Val96.7CPNet Res34, 5 CP
Action Recognition In VideosSomething-Something V2Top-1 Accuracy57.65CPNet Res34, 5 CP
Action Recognition In VideosSomething-Something V2Top-5 Accuracy83.95CPNet Res34, 5 CP

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22Active Multimodal Distillation for Few-shot Action Recognition2025-06-16