TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Recurrent Network Models for Human Dynamics

Recurrent Network Models for Human Dynamics

Katerina Fragkiadaki, Sergey Levine, Panna Felsen, Jitendra Malik

2015-08-02ICCV 2015 12Human Pose ForecastingRepresentation LearningOptical Flow EstimationHuman Dynamics
PaperPDF

Abstract

We propose the Encoder-Recurrent-Decoder (ERD) model for recognition and prediction of human body pose in videos and motion capture. The ERD model is a recurrent neural network that incorporates nonlinear encoder and decoder networks before and after recurrent layers. We test instantiations of ERD architectures in the tasks of motion capture (mocap) generation, body pose labeling and body pose forecasting in videos. Our model handles mocap training data across multiple subjects and activity domains, and synthesizes novel motions while avoid drifting for long periods of time. For human pose labeling, ERD outperforms a per frame body part detector by resolving left-right body part confusions. For video pose forecasting, ERD predicts body joint displacements across a temporal horizon of 400ms and outperforms a first order motion model based on optical flow. ERDs extend previous Long Short Term Memory (LSTM) models in the literature to jointly learn representations and their dynamics. Our experiments show such representation learning is crucial for both labeling and prediction in space-time. We find this is a distinguishing feature between the spatio-temporal visual domain in comparison to 1D text, speech or handwriting, where straightforward hard coded representations have shown excellent results when directly combined with recurrent units.

Results

TaskDatasetMetricValueModel
Pose EstimationHuman3.6MMAR, walking, 1,000ms2.38ERD
Pose EstimationHuman3.6MMAR, walking, 400ms1.78ERD
3DHuman3.6MMAR, walking, 1,000ms2.38ERD
3DHuman3.6MMAR, walking, 400ms1.78ERD
1 Image, 2*2 StitchiHuman3.6MMAR, walking, 1,000ms2.38ERD
1 Image, 2*2 StitchiHuman3.6MMAR, walking, 400ms1.78ERD

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction2025-07-15