EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance

Jaeyeon Kim, Minjeon Jeon, JaeYoon Jung, Sang Hoon Woo, Jinjoo Lee

2024-09-02Reranking Audio captioning Language Modelling

Abstract

In this work, we aim to analyze and optimize the EnCLAP framework, a state-of-the-art model in automated audio captioning. We investigate the impact of modifying the acoustic encoder components, explore pretraining with different dataset scales, and study the effectiveness of a reranking scheme. Through extensive experimentation and quantitative analysis of generated captions, we develop EnCLAP++, an enhanced version that significantly surpasses the original.

Results

Task	Dataset	Metric	Value	Model
Audio captioning	AudioCaps	CIDEr	0.823	EnCLAP++-large
Audio captioning	AudioCaps	FENSE	0.665	EnCLAP++-large
Audio captioning	AudioCaps	METEOR	0.269	EnCLAP++-large
Audio captioning	AudioCaps	SPICE	0.197	EnCLAP++-large
Audio captioning	AudioCaps	SPIDEr	0.51	EnCLAP++-large
Audio captioning	AudioCaps	CIDEr	0.815	EnCLAP++-base
Audio captioning	AudioCaps	FENSE	0.661	EnCLAP++-base
Audio captioning	AudioCaps	METEOR	0.257	EnCLAP++-base
Audio captioning	AudioCaps	SPICE	0.188	EnCLAP++-base
Audio captioning	AudioCaps	SPIDEr	0.501	EnCLAP++-base

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 Making Language Model a Hierarchical Classifier and Generator2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17 Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17 Assay2Mol: large language model-based drug design using BioAssay context2025-07-16 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing2025-07-16