TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Dynamic Scene Understanding from Vision-Language Represent...

Dynamic Scene Understanding from Vision-Language Representations

Shahaf Pruss, Morris Alper, Hadar Averbuch-Elor

2025-01-20Human-Object Interaction DetectionGrounded Situation RecognitionScene UnderstandingHuman-Human Interaction RecognitionHuman Interaction Recognition
PaperPDF

Abstract

Images depicting complex, dynamic scenes are challenging to parse automatically, requiring both high-level comprehension of the overall situation and fine-grained identification of participating entities and their interactions. Current approaches use distinct methods tailored to sub-tasks such as Situation Recognition and detection of Human-Human and Human-Object Interactions. However, recent advances in image understanding have often leveraged web-scale vision-language (V&L) representations to obviate task-specific engineering. In this work, we propose a framework for dynamic scene understanding tasks by leveraging knowledge from modern, frozen V&L representations. By framing these tasks in a generic manner - as predicting and parsing structured text, or by directly concatenating representations to the input of existing models - we achieve state-of-the-art results while using a minimal number of trainable parameters relative to existing approaches. Moreover, our analysis of dynamic knowledge of these representations shows that recent, more powerful representations effectively encode dynamic scene semantics, making this approach newly possible.

Results

TaskDatasetMetricValueModel
Human-Object Interaction DetectionHICO-DETmAP46.49Ours (PViC+)
Situation RecognitionimSituTop-1 Verb58.88Ours
Situation RecognitionSWiGTop-1 Verb58.88Ours (CoFormer+)
Situation RecognitionSWiGTop-1 Verb & Grounded-Value41.28Ours (CoFormer+)
Situation RecognitionSWiGTop-1 Verb & Value51.1Ours (CoFormer+)
Situation RecognitionSWiGTop-5 Verbs & Grounded-Value58.23Ours (CoFormer+)
Grounded Situation RecognitionSWiGTop-1 Verb58.88Ours (CoFormer+)
Grounded Situation RecognitionSWiGTop-1 Verb & Grounded-Value41.28Ours (CoFormer+)
Grounded Situation RecognitionSWiGTop-1 Verb & Value51.1Ours (CoFormer+)
Grounded Situation RecognitionSWiGTop-5 Verbs & Grounded-Value58.23Ours (CoFormer+)

Related Papers

Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation2025-07-15Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander2025-07-15Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14RoHOI: Robustness Benchmark for Human-Object Interaction Detection2025-07-12