Papers With Code 2 | ML Benchmarks, SotA Results & Code

The Cambrian Vision-Centric Benchmark (CV-Bench) is designed to address the limitations of existing vision-centric benchmarks by providing a comprehensive evaluation framework for multimodal large language models (MLLMs). With 2,638 manually-inspected examples, CV-Bench significantly surpasses other vision-centric MLLM benchmarks, offering 3.5 times more examples than RealWorldQA and 8.8 times more than MMVP.

Motivation and Content Summary:

CV-Bench repurposes standard vision benchmarks such as ADE20K, COCO, and Omni3D to assess models on classic vision tasks within a multimodal context. Leveraging the rich ground truth annotations from these benchmarks, natural language questions are formulated to probe the fundamental 2D and 3D understanding of models.

Potential Use Cases:

Evaluating the spatial relationship and object counting capabilities of models (2D understanding).
Assessing the depth order and relative distance understanding of models (3D understanding).
Benchmarking the performance of multimodal models in both vision-specific and cross-modal tasks.

Dataset Characteristics:

2D Understanding Tasks:
- Spatial Relationship: Determine the relative position of an object with respect to the anchor object, considering left-right or top-bottom relationships.
- Object Count: Determine the number of instances present in the image.
3D Understanding Tasks:
- Depth Order: Determine which of the two distinct objects is closer to the camera.
- Relative Distance: Determine which of the two distinct objects is closer to the anchor object.

| Type | Task | Description | Sources | # Samples | |------|----------------------|-----------------------------------------------------------------------------|----------------|-----------| | 2D | Spatial Relationship | Determine the relative position of an object w.r.t. the anchor object. | ADE20K, COCO | 650 | | 2D | Object Count | Determine the number of instances present in the image. | ADE20K, COCO | 788 | | 3D | Depth Order | Determine which of the two distinct objects is closer to the camera. | Omni3D | 600 | | 3D | Relative Distance | Determine which of the two distinct objects is closer to the anchor object. | Omni3D | 600 |

Curation Process:

Questions for each task are programmatically constructed and then manually inspected to ensure clarity and accuracy. Any unclear, ambiguous, or erroneous questions are removed to maintain the benchmark's reliability.