Clotho
AudioTextsOther (Attribution)Introduced 2019-01-01
Clotho is an audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long.
Source: https://zenodo.org/record/3490684 Image Source: https://arxiv.org/abs/1910.09387
Benchmarks
Audio captioning/SPIDErAudio captioning/CIDErAudio captioning/SPICEAudio captioning/BLEU-4Audio captioning/METEORAudio captioning/ROUGE-LAudio captioning/FENSEAudio captioning/SPIDEr-FLAudio captioning/Sentence-BERTText to Audio Retrieval/R@1Text to Audio Retrieval/R@5Text to Audio Retrieval/R@10Text to Audio Retrieval/mAP@10