CV-Bench

Cambrian Vision-Centric Benchmark

ImagesTextsApache-2.0 licenseIntroduced 2024-06-24

The Cambrian Vision-Centric Benchmark (CV-Bench) is designed to address the limitations of existing vision-centric benchmarks by providing a comprehensive evaluation framework for multimodal large language models (MLLMs). With 2,638 manually-inspected examples, CV-Bench significantly surpasses other vision-centric MLLM benchmarks, offering 3.5 times more examples than RealWorldQA and 8.8 times more than MMVP.

Motivation and Content Summary:

CV-Bench repurposes standard vision benchmarks such as ADE20K, COCO, and Omni3D to assess models on classic vision tasks within a multimodal context. Leveraging the rich ground truth annotations from these benchmarks, natural language questions are formulated to probe the fundamental 2D and 3D understanding of models.

Potential Use Cases:

  • Evaluating the spatial relationship and object counting capabilities of models (2D understanding).
  • Assessing the depth order and relative distance understanding of models (3D understanding).
  • Benchmarking the performance of multimodal models in both vision-specific and cross-modal tasks.

Dataset Characteristics:

  • 2D Understanding Tasks:

    • Spatial Relationship: Determine the relative position of an object with respect to the anchor object, considering left-right or top-bottom relationships.
    • Object Count: Determine the number of instances present in the image.
  • 3D Understanding Tasks:

    • Depth Order: Determine which of the two distinct objects is closer to the camera.
    • Relative Distance: Determine which of the two distinct objects is closer to the anchor object.

| Type | Task | Description | Sources | # Samples | |------|----------------------|-----------------------------------------------------------------------------|----------------|-----------| | 2D | Spatial Relationship | Determine the relative position of an object w.r.t. the anchor object. | ADE20K, COCO | 650 | | 2D | Object Count | Determine the number of instances present in the image. | ADE20K, COCO | 788 | | 3D | Depth Order | Determine which of the two distinct objects is closer to the camera. | Omni3D | 600 | | 3D | Relative Distance | Determine which of the two distinct objects is closer to the anchor object. | Omni3D | 600 |

Curation Process:

Questions for each task are programmatically constructed and then manually inspected to ensure clarity and accuracy. Any unclear, ambiguous, or erroneous questions are removed to maintain the benchmark's reliability.