Papers With Code 2 | ML Benchmarks, SotA Results & Code

This is a simple audio-visual dataset artificially assembled from independent visual and audio datasets. The first modality corresponds to 28 × 28 MNIST images, with 75% of their energy removed by PCA. The audio modality is made of audio samples on which we have computed 112 × 112 spectrograms. The audio samples are 25,102 pronounced digits of the Tidigits database augmented by adding randomly chosen noise samples from the ESC-50 dataset. Contaminated audio samples are randomly paired, accordingly with labels, with MNIST digits in order to reach 55,000 pairs for training and 10,000 pairs for testing. For validation we take 5000 samples from the training set.

AV-MNIST