TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ClipSitu: Effectively Leveraging CLIP for Conditional Pred...

ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in Situation Recognition

Debaditya Roy, Dhruv Verma, Basura Fernando

2023-07-02IEEE WACV 2024 1Grounded Situation Recognition
PaperPDFCode(official)

Abstract

Situation Recognition is the task of generating a structured summary of what is happening in an image using an activity verb and the semantic roles played by actors and objects. In this task, the same activity verb can describe a diverse set of situations as well as the same actor or object category can play a diverse set of semantic roles depending on the situation depicted in the image. Hence a situation recognition model needs to understand the context of the image and the visual-linguistic meaning of semantic roles. Therefore, we leverage the CLIP foundational model that has learned the context of images via language descriptions. We show that deeper-and-wider multi-layer perceptron (MLP) blocks obtain noteworthy results for the situation recognition task by using CLIP image and text embedding features and it even outperforms the state-of-the-art CoFormer, a Transformer-based model, thanks to the external implicit visual-linguistic knowledge encapsulated by CLIP and the expressive power of modern MLP block designs. Motivated by this, we design a cross-attention-based Transformer using CLIP visual tokens that model the relation between textual roles and visual entities. Our cross-attention-based Transformer known as ClipSitu XTF outperforms existing state-of-the-art by a large margin of 14.1\% on semantic role labelling (value) for top-1 accuracy using imSitu dataset. {Similarly, our ClipSitu XTF obtains state-of-the-art situation localization performance.} We will make the code publicly available.

Results

TaskDatasetMetricValueModel
Situation RecognitionimSituTop-1 Verb47.23ClipSitu
Situation RecognitionimSituTop-1 Verb & Value29.73ClipSitu
Situation RecognitionimSituTop-5 Verbs85.69ClipSitu
Situation RecognitionimSituTop-5 Verbs & Value68.42ClipSitu
Situation RecognitionSWiGTop-1 Verb58.19ClipSitu
Situation RecognitionSWiGTop-1 Verb & Grounded-Value40.01ClipSitu
Situation RecognitionSWiGTop-1 Verb & Value47.23ClipSitu
Situation RecognitionSWiGTop-5 Verbs85.69ClipSitu
Situation RecognitionSWiGTop-5 Verbs & Grounded-Value49.78ClipSitu
Situation RecognitionSWiGTop-5 Verbs & Value68.42ClipSitu
Grounded Situation RecognitionSWiGTop-1 Verb58.19ClipSitu
Grounded Situation RecognitionSWiGTop-1 Verb & Grounded-Value40.01ClipSitu
Grounded Situation RecognitionSWiGTop-1 Verb & Value47.23ClipSitu
Grounded Situation RecognitionSWiGTop-5 Verbs85.69ClipSitu
Grounded Situation RecognitionSWiGTop-5 Verbs & Grounded-Value49.78ClipSitu
Grounded Situation RecognitionSWiGTop-5 Verbs & Value68.42ClipSitu

Related Papers

Dynamic Scene Understanding from Vision-Language Representations2025-01-20Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer2024-04-24Open Scene Understanding: Grounded Situation Recognition Meets Segment Anything for Helping People with Visual Impairments2023-07-15GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement2022-08-18Collaborative Transformers for Grounded Situation Recognition2022-03-30Rethinking the Two-Stage Framework for Grounded Situation Recognition2021-12-10Grounded Situation Recognition with Transformers2021-11-19Attention-Based Context Aware Reasoning for Situation Recognition2020-06-01