TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Capturing and Inferring Dense Full-Body Human-Scene Contact

Capturing and Inferring Dense Full-Body Human-Scene Contact

Chun-Hao P. Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, Michael J. Black

2022-06-20CVPR 2022 1Markerless Motion CaptureDense contact estimationMonocular 3D Human Pose EstimationContact Detection4k
PaperPDFCode(official)Code(official)

Abstract

Inferring human-scene contact (HSC) is the first step toward understanding how humans interact with their surroundings. While detecting 2D human-object interaction (HOI) and reconstructing 3D human pose and shape (HPS) have enjoyed significant progress, reasoning about 3D human-scene contact from a single image is still challenging. Existing HSC detection methods consider only a few types of predefined contact, often reduce body and scene to a small number of primitives, and even overlook image evidence. To predict human-scene contact from a single image, we address the limitations above from both data and algorithmic perspectives. We capture a new dataset called RICH for "Real scenes, Interaction, Contact and Humans." RICH contains multiview outdoor/indoor video sequences at 4K resolution, ground-truth 3D human bodies captured using markerless motion capture, 3D body scans, and high resolution 3D scene scans. A key feature of RICH is that it also contains accurate vertex-level contact labels on the body. Using RICH, we train a network that predicts dense body-scene contacts from a single RGB image. Our key insight is that regions in contact are always occluded so the network needs the ability to explore the whole image for evidence. We use a transformer to learn such non-local relationships and propose a new Body-Scene contact TRansfOrmer (BSTRO). Very few methods explore 3D contact; those that do focus on the feet only, detect foot contact as a post-processing step, or infer contact from body pose without looking at the scene. To our knowledge, BSTRO is the first method to directly estimate 3D body-scene contact from a single image. We demonstrate that BSTRO significantly outperforms the prior art. The code and dataset are available at https://rich.is.tue.mpg.de.

Results

TaskDatasetMetricValueModel
Human Interaction RecognitionMOWF1-Score0.112BSTRO
Human Interaction RecognitionMOWPrecision0.204BSTRO
Human Interaction RecognitionMOWRecall0.126BSTRO
Contact DetectionBEHAVEPrecision0.615BSTRO
Contact DetectionBEHAVERecall0.527BSTRO

Related Papers

Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation2025-07-114KAgent: Agentic Any Image to 4K Super-Resolution2025-07-09AUTOMATIC ROOM LIGHT CONTROLLER MANAGEMENT SYSTEM.2025-06-25Systematic Comparison of Projection Methods for Monocular 3D Human Pose Estimation on Fisheye Images2025-06-24Fast Neural Inverse Kinematics on Human Body Motions2025-06-22PoseGRAF: Geometric-Reinforced Adaptive Fusion for Monocular 3D Human Pose Estimation2025-06-17MAMMA: Markerless & Automatic Multi-Person Motion Action Capture2025-06-16UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions2025-06-16