Papers With Code 2 | ML Benchmarks, SotA Results & Code

We introduce a new synthetic test set named IS3 for interactive sound source localization. By leveraging diffusion models, we generate images containing multiple sounding objects. Any combination of sounding objects can appear in the same scene. Additionally, this dataset offers unusual scenes and unique combinations that are rarely found in nature, such as ‘a donkey playing a saxophone’ or ‘a sea lion on the snow’. This dataset provides both segmentation maps and bounding box information with class categories. IS3 includes 3240 images, resulting in 6480 unique audio-visual instances (with 2 objects per image) across 118 categories. This dataset can be used in below tasks:

Sound Source Localization
Audio-Visual Segmentation
Semantic Segmentation

IS3 (Interactive-Synthetic Sound Source) Dataset