Dynamic Scene Understanding from Vision-Language Representations

Shahaf Pruss, Morris Alper, Hadar Averbuch-Elor

2025-01-20Human-Object Interaction Detection Grounded Situation Recognition Scene Understanding Human-Human Interaction Recognition Human Interaction Recognition

Paper PDF

Abstract

Images depicting complex, dynamic scenes are challenging to parse automatically, requiring both high-level comprehension of the overall situation and fine-grained identification of participating entities and their interactions. Current approaches use distinct methods tailored to sub-tasks such as Situation Recognition and detection of Human-Human and Human-Object Interactions. However, recent advances in image understanding have often leveraged web-scale vision-language (V&L) representations to obviate task-specific engineering. In this work, we propose a framework for dynamic scene understanding tasks by leveraging knowledge from modern, frozen V&L representations. By framing these tasks in a generic manner - as predicting and parsing structured text, or by directly concatenating representations to the input of existing models - we achieve state-of-the-art results while using a minimal number of trainable parameters relative to existing approaches. Moreover, our analysis of dynamic knowledge of these representations shows that recent, more powerful representations effectively encode dynamic scene semantics, making this approach newly possible.

Results

Task	Dataset	Metric	Value	Model
Human-Object Interaction Detection	HICO-DET	mAP	46.49	Ours (PViC+)
Situation Recognition	imSitu	Top-1 Verb	58.88	Ours
Situation Recognition	SWiG	Top-1 Verb	58.88	Ours (CoFormer+)
Situation Recognition	SWiG	Top-1 Verb & Grounded-Value	41.28	Ours (CoFormer+)
Situation Recognition	SWiG	Top-1 Verb & Value	51.1	Ours (CoFormer+)
Situation Recognition	SWiG	Top-5 Verbs & Grounded-Value	58.23	Ours (CoFormer+)
Grounded Situation Recognition	SWiG	Top-1 Verb	58.88	Ours (CoFormer+)
Grounded Situation Recognition	SWiG	Top-1 Verb & Grounded-Value	41.28	Ours (CoFormer+)
Grounded Situation Recognition	SWiG	Top-1 Verb & Value	51.1	Ours (CoFormer+)
Grounded Situation Recognition	SWiG	Top-5 Verbs & Grounded-Value	58.23	Ours (CoFormer+)

Dynamic Scene Understanding from Vision-Language Representations

Abstract

Results

Related Papers

Dynamic Scene Understanding from Vision-Language Representations

Abstract

Results

Related Papers