SPEECH-COCO

SpeechCC BY 4.0

SPEECH-COCO contains speech captions that are generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (more than 600h) paired with images.

Source: SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set