Papers With Code 2 | ML Benchmarks, SotA Results & Code

Due to the free-form nature of the open vocabulary image classification task, special annotations are required for image sets used for evaluation purposes. Three such image datasets are presented here:

World: 272 images of which the grand majority are originally sourced (have never been on the internet) from 10 countries by 12 people, with an active focus on covering as wide and varied concepts as possible, including unusual, deceptive and/or indirect representations of objects,
Wiki: 1000 Wikipedia lead images sampled from a scraped pool of 18K,
Val3K: 3000 images from the ImageNet-1K validation set, sampled uniformly across the classes.

It is not in general possible to exhaustively annotate ground truth classification labels for open vocabulary image sets, as this would require annotations for every possible correct object noun in the English language for every visible entity in every part of every image. It is possible however, to annotate the thousands of predictions that have been made across the image sets by open vocabulary models trained thus far. All three image datasets presented here have been individually annotated by both human and multimodal LLM annotators for the object nouns that were predicted by trained models. The annotations specify whether each classification is correct, close, or incorrect, and for the human annotations, whether it relates to a primary or secondary element of the image. It is customary to use the suffixes -H and -L to clearly specify which annotations are being referred to at any time, e.g. Wiki-H is the Wiki dataset with corresponding human annotations. All three datasets together contain a total of 17.4K human and 112K LLM class annotations.

The data is directly available at the following links:

Refer to the NOVIC code for an example of how the datasets can be used, as well as tools for updating the class annotations for newer model predictions.

World: 272 images of which the grand majority are originally sourced (have never been on the internet) from 10 countries by 12 people, with an active focus on covering as wide and varied concepts as possible, including unusual, deceptive and/or indirect representations of objects,
Wiki: 1000 Wikipedia lead images sampled from a scraped pool of 18K,
Val3K: 3000 images from the ImageNet-1K validation set, sampled uniformly across the classes.

The data is directly available at the following links:

Refer to the NOVIC code for an example of how the datasets can be used, as well as tools for updating the class annotations for newer model predictions.

OVIC Datasets

Related Benchmarks

OVIC Datasets

Related Benchmarks