Jaeyeon Kim, Minjeon Jeon, JaeYoon Jung, Sang Hoon Woo, Jinjoo Lee
In this work, we aim to analyze and optimize the EnCLAP framework, a state-of-the-art model in automated audio captioning. We investigate the impact of modifying the acoustic encoder components, explore pretraining with different dataset scales, and study the effectiveness of a reranking scheme. Through extensive experimentation and quantitative analysis of generated captions, we develop EnCLAP++, an enhanced version that significantly surpasses the original.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Audio captioning | AudioCaps | CIDEr | 0.823 | EnCLAP++-large |
| Audio captioning | AudioCaps | FENSE | 0.665 | EnCLAP++-large |
| Audio captioning | AudioCaps | METEOR | 0.269 | EnCLAP++-large |
| Audio captioning | AudioCaps | SPICE | 0.197 | EnCLAP++-large |
| Audio captioning | AudioCaps | SPIDEr | 0.51 | EnCLAP++-large |
| Audio captioning | AudioCaps | CIDEr | 0.815 | EnCLAP++-base |
| Audio captioning | AudioCaps | FENSE | 0.661 | EnCLAP++-base |
| Audio captioning | AudioCaps | METEOR | 0.257 | EnCLAP++-base |
| Audio captioning | AudioCaps | SPICE | 0.188 | EnCLAP++-base |
| Audio captioning | AudioCaps | SPIDEr | 0.501 | EnCLAP++-base |