CV-Bench
Cambrian Vision-Centric Benchmark
The Cambrian Vision-Centric Benchmark (CV-Bench) is designed to address the limitations of existing vision-centric benchmarks by providing a comprehensive evaluation framework for multimodal large language models (MLLMs). With 2,638 manually-inspected examples, CV-Bench significantly surpasses other vision-centric MLLM benchmarks, offering 3.5 times more examples than RealWorldQA and 8.8 times more than MMVP.
Motivation and Content Summary:
CV-Bench repurposes standard vision benchmarks such as ADE20K, COCO, and Omni3D to assess models on classic vision tasks within a multimodal context. Leveraging the rich ground truth annotations from these benchmarks, natural language questions are formulated to probe the fundamental 2D and 3D understanding of models.
Potential Use Cases:
- Evaluating the spatial relationship and object counting capabilities of models (2D understanding).
- Assessing the depth order and relative distance understanding of models (3D understanding).
- Benchmarking the performance of multimodal models in both vision-specific and cross-modal tasks.
Dataset Characteristics:
-
2D Understanding Tasks:
- Spatial Relationship: Determine the relative position of an object with respect to the anchor object, considering left-right or top-bottom relationships.
- Object Count: Determine the number of instances present in the image.
-
3D Understanding Tasks:
- Depth Order: Determine which of the two distinct objects is closer to the camera.
- Relative Distance: Determine which of the two distinct objects is closer to the anchor object.
| Type | Task | Description | Sources | # Samples | |------|----------------------|-----------------------------------------------------------------------------|----------------|-----------| | 2D | Spatial Relationship | Determine the relative position of an object w.r.t. the anchor object. | ADE20K, COCO | 650 | | 2D | Object Count | Determine the number of instances present in the image. | ADE20K, COCO | 788 | | 3D | Depth Order | Determine which of the two distinct objects is closer to the camera. | Omni3D | 600 | | 3D | Relative Distance | Determine which of the two distinct objects is closer to the anchor object. | Omni3D | 600 |
Curation Process:
Questions for each task are programmatically constructed and then manually inspected to ensure clarity and accuracy. Any unclear, ambiguous, or erroneous questions are removed to maintain the benchmark's reliability.