TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Unsupervised Learning of Object Structure and Dynamics fro...

Unsupervised Learning of Object Structure and Dynamics from Videos

Matthias Minderer, Chen Sun, Ruben Villegas, Forrester Cole, Kevin Murphy, Honglak Lee

2019-06-19NeurIPS 2019 12Video PredictionContinuous ControlObject TrackingAction Recognition
PaperPDFCode(official)

Abstract

Extracting and predicting object structure and dynamics from videos without supervision is a major challenge in machine learning. To address this challenge, we adopt a keypoint-based image representation and learn a stochastic dynamics model of the keypoints. Future frames are reconstructed from the keypoints and a reference frame. By modeling dynamics in the keypoint coordinate space, we achieve stable learning and avoid compounding of errors in pixel space. Our method improves upon unstructured representations both for pixel-level video prediction and for downstream tasks requiring object-level understanding of motion dynamics. We evaluate our model on diverse datasets: a multi-agent sports dataset, the Human3.6M dataset, and datasets based on continuous control tasks from the DeepMind Control Suite. The spatially structured representation outperforms unstructured representations on a range of motion-related tasks such as object tracking, action recognition and reward prediction.

Results

TaskDatasetMetricValueModel
VideoKTHCond10Struct-VRNN (from Grid-keypoints)
VideoKTHFVD395Struct-VRNN (from Grid-keypoints)
VideoKTHLPIPS0.124Struct-VRNN (from Grid-keypoints)
VideoKTHPSNR24.29Struct-VRNN (from Grid-keypoints)
VideoKTHParams (M)2.3Struct-VRNN (from Grid-keypoints)
VideoKTHPred40Struct-VRNN (from Grid-keypoints)
VideoKTHSSIM0.766Struct-VRNN (from Grid-keypoints)
VideoKTHTrain10Struct-VRNN (from Grid-keypoints)
Video PredictionKTHCond10Struct-VRNN (from Grid-keypoints)
Video PredictionKTHFVD395Struct-VRNN (from Grid-keypoints)
Video PredictionKTHLPIPS0.124Struct-VRNN (from Grid-keypoints)
Video PredictionKTHPSNR24.29Struct-VRNN (from Grid-keypoints)
Video PredictionKTHParams (M)2.3Struct-VRNN (from Grid-keypoints)
Video PredictionKTHPred40Struct-VRNN (from Grid-keypoints)
Video PredictionKTHSSIM0.766Struct-VRNN (from Grid-keypoints)
Video PredictionKTHTrain10Struct-VRNN (from Grid-keypoints)

Related Papers

Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)2025-07-17MVA 2025 Small Multi-Object Tracking for Spotting Birds Challenge: Dataset, Methods, and Results2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17YOLOv8-SMOT: An Efficient and Robust Framework for Real-Time Small Object Tracking via Slice-Assisted Training and Adaptive Association2025-07-16HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking2025-07-10Robustifying 3D Perception through Least-Squares Multi-Agent Graphs Object Tracking2025-07-07UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions2025-07-01Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01