GroundCap

ImagesTextsIntroduced 2025-02-19

GroundCap is a novel grounded image captioning dataset derived from MovieNet, containing 52,350 movie frames with detailed grounded captions. The dataset uniquely features an ID-based system that maintains object identity throughout captions, enables tracking of object interactions, and grounds not only objects but also actions and locations in the scene.

Data Instances

Each sample in the dataset contains:

  • An image (movie frame)
  • Object detections with:
  • Unique object IDs
  • Class labels
  • Confidence scores
  • Bounding box coordinates
  • A grounded caption with three types of grounding tags:
  • <gdo> for grounding objects (e.g., "the person", "a car")
  • <gda> for grounding actions (e.g., "running", "sitting")
  • <gdl> for grounding locations (e.g., "on the bridge", "in the kitchen")

Data Fields

  • id: Unique identifier for each caption
  • image: The movie frame being captioned
  • detections: List of detected objects containing:
    • id: Object's unique identifier (integer starting at 0 for each class)
    • label: Object class label
    • score: Detection confidence score
    • box: Bounding box coordinates (x, y, w, h)
  • caption: Grounded caption text with HTML tags
    • <gdo> tags ground object references to detections using {class}-{id} as attribute (e.g., <gdo class="person" person-0>the man</gdo>)
    • <gda> tags ground actions to objects using {class}-{id} as attribute (e.g., <gda class="run" person-0>running</gda>)
    • <gdl> tags ground locations to objects using {class}-{id} as attribute (e.g., <gdl class="couch" couch-0>on the couch</gdl>)
  • human_annotated: Boolean indicating whether the caption was automatically generated (False) or human-refined (True)

Multiple objects can be referenced in a single tag. For instance, <gdo class="person" person-0 person-1>the two people</gdo> refers to two detected people, namely to the detections with IDs 0 and 1 and the class label "person".