Collaborative Transformers for Grounded Situation Recognition

Junhyeong Cho, Youngseok Yoon, Suha Kwak

2022-03-30CVPR 2022 1Visual Grounding Image Classification Grounded Situation Recognition Scene Understanding Visual Reasoning Object Detection

Paper PDF Code Code Code(official)

Abstract

Grounded situation recognition is the task of predicting the main activity, entities playing certain roles within the activity, and bounding-box groundings of the entities in the given image. To effectively deal with this challenging task, we introduce a novel approach where the two processes for activity classification and entity estimation are interactive and complementary. To implement this idea, we propose Collaborative Glance-Gaze TransFormer (CoFormer) that consists of two modules: Glance transformer for activity classification and Gaze transformer for entity estimation. Glance transformer predicts the main activity with the help of Gaze transformer that analyzes entities and their relations, while Gaze transformer estimates the grounded entities by focusing only on the entities relevant to the activity predicted by Glance transformer. Our CoFormer achieves the state of the art in all evaluation metrics on the SWiG dataset. Training code and model weights are available at https://github.com/jhcho99/CoFormer.

Results

Task	Dataset	Metric	Value	Model
Situation Recognition	imSitu	Top-1 Verb	44.66	CoFormer
Situation Recognition	imSitu	Top-1 Verb & Value	35.98	CoFormer
Situation Recognition	imSitu	Top-5 Verbs	73.31	CoFormer
Situation Recognition	imSitu	Top-5 Verbs & Value	57.76	CoFormer
Situation Recognition	SWiG	Top-1 Verb	44.66	CoFormer
Situation Recognition	SWiG	Top-1 Verb & Grounded-Value	29.05	CoFormer
Situation Recognition	SWiG	Top-1 Verb & Value	35.98	CoFormer
Situation Recognition	SWiG	Top-5 Verbs	73.31	CoFormer
Situation Recognition	SWiG	Top-5 Verbs & Grounded-Value	46.25	CoFormer
Situation Recognition	SWiG	Top-5 Verbs & Value	57.76	CoFormer
Grounded Situation Recognition	SWiG	Top-1 Verb	44.66	CoFormer
Grounded Situation Recognition	SWiG	Top-1 Verb & Grounded-Value	29.05	CoFormer
Grounded Situation Recognition	SWiG	Top-1 Verb & Value	35.98	CoFormer
Grounded Situation Recognition	SWiG	Top-5 Verbs	73.31	CoFormer
Grounded Situation Recognition	SWiG	Top-5 Verbs & Grounded-Value	46.25	CoFormer
Grounded Situation Recognition	SWiG	Top-5 Verbs & Value	57.76	CoFormer

Collaborative Transformers for Grounded Situation Recognition

Abstract

Results

Related Papers

Collaborative Transformers for Grounded Situation Recognition

Abstract

Results

Related Papers