Papers With Code 2 | ML Benchmarks, SotA Results & Code

GroundCap is a novel grounded image captioning dataset derived from MovieNet, containing 52,350 movie frames with detailed grounded captions. The dataset uniquely features an ID-based system that maintains object identity throughout captions, enables tracking of object interactions, and grounds not only objects but also actions and locations in the scene.

Data Instances

Each sample in the dataset contains:

An image (movie frame)
Object detections with:
Unique object IDs
Class labels
Confidence scores
Bounding box coordinates
A grounded caption with three types of grounding tags:
<gdo> for grounding objects (e.g., "the person", "a car")
<gda> for grounding actions (e.g., "running", "sitting")
<gdl> for grounding locations (e.g., "on the bridge", "in the kitchen")

Data Fields

id: Unique identifier for each caption
image: The movie frame being captioned
detections: List of detected objects containing:
- id: Object's unique identifier (integer starting at 0 for each class)
- label: Object class label
- score: Detection confidence score
- box: Bounding box coordinates (x, y, w, h)
caption: Grounded caption text with HTML tags
- <gdo> tags ground object references to detections using {class}-{id} as attribute (e.g., <gdo class="person" person-0>the man</gdo>)
- <gda> tags ground actions to objects using {class}-{id} as attribute (e.g., <gda class="run" person-0>running</gda>)
- <gdl> tags ground locations to objects using {class}-{id} as attribute (e.g., <gdl class="couch" couch-0>on the couch</gdl>)
human_annotated: Boolean indicating whether the caption was automatically generated (False) or human-refined (True)

Multiple objects can be referenced in a single tag. For instance, <gdo class="person" person-0 person-1>the two people</gdo> refers to two detected people, namely to the detections with IDs 0 and 1 and the class label "person".