Papers With Code 2 | ML Benchmarks, SotA Results & Code

It consists of 32x32 pixel images of shapes with multiple attributes (size, location, rotation, color). Each image is also paired with its ground truth information (attributes), and a natural language description (English) of the image.

The dataset is composed of: a train set of 500,000 samples, a val and a test set of 1000 samples each.

It also contains already processed 12-dimensional visual features (from a VAE), and presaved BERT features of the text descriptions.

Link to dataset: https://zenodo.org/record/8112838

Simple Shapes Dataset