TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MM-OR: A Large Multimodal Operating Room Dataset for Seman...

MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments

Ege Özsoy, Chantal Pellegrini, Tobias Czempiel, Felix Tristram, Kun Yuan, David Bani-Harouni, Ulrich Eck, Benjamin Busam, Matthias Keicher, Nassir Navab

2025-03-04CVPR 2025 1Scene Graph GenerationVideo Panoptic Segmentation2D Panoptic SegmentationGraph GenerationLanguage Modelling
PaperPDFCode(official)

Abstract

Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing surgical assistance, situational awareness, and patient safety. Current datasets fall short in scale, realism and do not capture the multimodal nature of OR scenes, limiting progress in OR modeling. To this end, we introduce MM-OR, a realistic and large-scale multimodal spatiotemporal OR dataset, and the first dataset to enable multimodal scene graph generation. MM-OR captures comprehensive OR scenes containing RGB-D data, detail views, audio, speech transcripts, robotic logs, and tracking data and is annotated with panoptic segmentations, semantic scene graphs, and downstream task labels. Further, we propose MM2SG, the first multimodal large vision-language model for scene graph generation, and through extensive experiments, demonstrate its ability to effectively leverage multimodal inputs. Together, MM-OR and MM2SG establish a new benchmark for holistic OR understanding, and open the path towards multimodal scene analysis in complex, high-stakes environments. Our code, and data is available at https://github.com/egeozsoy/MM-OR.

Results

TaskDatasetMetricValueModel
Scene Parsing4D-ORF10.901MM2SG
Scene ParsingMM-ORMacro F10.529MM2SG
Semantic Segmentation4D-ORVPQ69.8MM-OR-VPQ4
Semantic Segmentation4D-ORVPQ69.2MM-OR-VPQ8
Semantic SegmentationMM-ORVPQ67MM-OR-VPQ4
Semantic SegmentationMM-ORVPQ66.4MM-OR-VPQ8
2D Semantic Segmentation4D-ORF10.901MM2SG
2D Semantic SegmentationMM-ORMacro F10.529MM2SG
Scene Graph Generation4D-ORF10.901MM2SG
Scene Graph GenerationMM-ORMacro F10.529MM2SG
10-shot image generation4D-ORVPQ69.8MM-OR-VPQ4
10-shot image generation4D-ORVPQ69.2MM-OR-VPQ8
10-shot image generationMM-ORVPQ67MM-OR-VPQ4
10-shot image generationMM-ORVPQ66.4MM-OR-VPQ8
Panoptic Segmentation4D-ORVPQ69.8MM-OR-VPQ4
Panoptic Segmentation4D-ORVPQ69.2MM-OR-VPQ8
Panoptic SegmentationMM-ORVPQ67MM-OR-VPQ4
Panoptic SegmentationMM-ORVPQ66.4MM-OR-VPQ8
2D Panoptic SegmentationMM-ORVPQ67.5MM-OR
2D Panoptic Segmentation4D-ORVPQ71.8MM-OR

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21NGTM: Substructure-based Neural Graph Topic Model for Interpretable Graph Generation2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16