TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Self-supervised Learning of Motion Capture

Self-supervised Learning of Motion Capture

Hsiao-Yu Fish Tung, Hsiao-Wei Tung, Ersin Yumer, Katerina Fragkiadaki

2017-12-04NeurIPS 2017 123D Human Pose EstimationWeakly-supervised 3D Human Pose EstimationOptical Flow EstimationSelf-Supervised Learning3D Human Reconstruction
PaperPDFCode

Abstract

Current state-of-the-art solutions for motion capture from a single camera are optimization driven: they optimize the parameters of a 3D human model so that its re-projection matches measurements in the video (e.g. person segmentation, optical flow, keypoint detections etc.). Optimization models are susceptible to local minima. This has been the bottleneck that forced using clean green-screen like backgrounds at capture time, manual initialization, or switching to multiple cameras as input resource. In this work, we propose a learning based motion capture model for single camera input. Instead of optimizing mesh and skeleton parameters directly, our model optimizes neural network weights that predict 3D shape and skeleton configurations given a monocular RGB video. Our model is trained using a combination of strong supervision from synthetic data, and self-supervision from differentiable rendering of (a) skeletal keypoints, (b) dense 3D mesh motion, and (c) human-background segmentation, in an end-to-end framework. Empirically we show our model combines the best of both worlds of supervised learning and test-time optimization: supervised learning initializes the model parameters in the right regime, ensuring good pose and surface initialization at test time, without manual effort. Self-supervision by back-propagating through differentiable rendering allows (unsupervised) adaptation of the model to the test data, and offers much tighter fit than a pretrained fixed model. We show that the proposed model improves with experience and converges to low-error solutions where previous optimization methods fail.

Results

TaskDatasetMetricValueModel
ReconstructionSurrealMPVPE74.5self-supervised mocap
3D Human Pose EstimationSurrealMPJPE64.4self-supervised mocap
3D Human Pose EstimationHuman3.6MAverage MPJPE (mm)98.4self-supervised mocap
Pose EstimationSurrealMPJPE64.4self-supervised mocap
Pose EstimationHuman3.6MAverage MPJPE (mm)98.4self-supervised mocap
3DSurrealMPJPE64.4self-supervised mocap
3DHuman3.6MAverage MPJPE (mm)98.4self-supervised mocap
1 Image, 2*2 StitchiSurrealMPJPE64.4self-supervised mocap
1 Image, 2*2 StitchiHuman3.6MAverage MPJPE (mm)98.4self-supervised mocap

Related Papers

Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14An Efficient Approach for Muscle Segmentation and 3D Reconstruction Using Keypoint Tracking in MRI Scan2025-07-11Learning to Track Any Points from Human Motion2025-07-08Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation2025-07-07World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model2025-07-01