TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Grounded Situation Recognition

Grounded Situation Recognition

Sarah Pratt, Mark Yatskar, Luca Weihs, Ali Farhadi, Aniruddha Kembhavi

2020-03-26ECCV 2020 8Grounded Situation RecognitionRetrievalImage Retrieval
PaperPDFCode

Abstract

We introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images describing: the primary activity, entities engaged in the activity with their roles (e.g. agent, tool), and bounding-box groundings of entities. GSR presents important technical challenges: identifying semantic saliency, categorizing and localizing a large and diverse set of entities, overcoming semantic sparsity, and disambiguating roles. Moreover, unlike in captioning, GSR is straightforward to evaluate. To study this new task we create the Situations With Groundings (SWiG) dataset which adds 278,336 bounding-box groundings to the 11,538 entity classes in the imsitu dataset. We propose a Joint Situation Localizer and find that jointly predicting situations and groundings with end-to-end training handily outperforms independent training on the entire grounding metric suite with relative gains between 8% and 32%. Finally, we show initial findings on three exciting future directions enabled by our models: conditional querying, visual chaining, and grounded semantic aware image retrieval. Code and data available at https://prior.allenai.org/projects/gsr.

Results

TaskDatasetMetricValueModel
Situation RecognitionimSituTop-1 Verb39.94JSL
Situation RecognitionimSituTop-1 Verb & Value31.44JSL
Situation RecognitionimSituTop-5 Verbs67.6JSL
Situation RecognitionimSituTop-5 Verbs & Value51.88JSL
Situation RecognitionimSituTop-1 Verb39.36ISL
Situation RecognitionimSituTop-1 Verb & Value30.09ISL
Situation RecognitionimSituTop-5 Verbs65.51ISL
Situation RecognitionimSituTop-5 Verbs & Value50.16ISL
Situation RecognitionSWiGTop-1 Verb39.94JSL
Situation RecognitionSWiGTop-1 Verb & Grounded-Value24.86JSL
Situation RecognitionSWiGTop-1 Verb & Value31.44JSL
Situation RecognitionSWiGTop-5 Verbs67.6JSL
Situation RecognitionSWiGTop-5 Verbs & Grounded-Value40.6JSL
Situation RecognitionSWiGTop-5 Verbs & Value51.88JSL
Situation RecognitionSWiGTop-1 Verb39.36ISL
Situation RecognitionSWiGTop-1 Verb & Grounded-Value22.73ISL
Situation RecognitionSWiGTop-1 Verb & Value30.09ISL
Situation RecognitionSWiGTop-5 Verbs65.51ISL
Situation RecognitionSWiGTop-5 Verbs & Grounded-Value36.6ISL
Situation RecognitionSWiGTop-5 Verbs & Value50.16ISL
Grounded Situation RecognitionSWiGTop-1 Verb39.94JSL
Grounded Situation RecognitionSWiGTop-1 Verb & Grounded-Value24.86JSL
Grounded Situation RecognitionSWiGTop-1 Verb & Value31.44JSL
Grounded Situation RecognitionSWiGTop-5 Verbs67.6JSL
Grounded Situation RecognitionSWiGTop-5 Verbs & Grounded-Value40.6JSL
Grounded Situation RecognitionSWiGTop-5 Verbs & Value51.88JSL
Grounded Situation RecognitionSWiGTop-1 Verb39.36ISL
Grounded Situation RecognitionSWiGTop-1 Verb & Grounded-Value22.73ISL
Grounded Situation RecognitionSWiGTop-1 Verb & Value30.09ISL
Grounded Situation RecognitionSWiGTop-5 Verbs65.51ISL
Grounded Situation RecognitionSWiGTop-5 Verbs & Grounded-Value36.6ISL
Grounded Situation RecognitionSWiGTop-5 Verbs & Value50.16ISL

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16