M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation

Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Masahiro Yasuda, Shunsuke Tsubaki, Keisuke Imoto

2024-06-04Audio Classification Self-Supervised Learning Transfer Learning Linear evaluation

Abstract

Contrastive language-audio pre-training (CLAP) enables zero-shot (ZS) inference of audio and exhibits promising performance in several classification tasks. However, conventional audio representations are still crucial for many tasks where ZS is not applicable (e.g., regression problems). Here, we explore a new representation, a general-purpose audio-language representation, that performs well in both ZS and transfer learning. To do so, we propose a new method, M2D-CLAP, which combines self-supervised learning Masked Modeling Duo (M2D) and CLAP. M2D learns an effective representation to model audio signals, and CLAP aligns the representation with text embedding. As a result, M2D-CLAP learns a versatile representation that allows for both ZS and transfer learning. Experiments show that M2D-CLAP performs well on linear evaluation, fine-tuning, and ZS classification with a GTZAN state-of-the-art of 75.17%, thus achieving a general-purpose audio-language representation.

Results

Task	Dataset	Metric	Value	Model
Audio Classification	ESC-50	Accuracy (5-fold)	97.4	M2D-CLAP/0.7
Audio Classification	ESC-50	Top-1 Accuracy	97.4	M2D-CLAP/0.7
Audio Classification	AudioSet	Test mAP	0.485	M2D-CLAP/0.7
Classification	ESC-50	Accuracy (5-fold)	97.4	M2D-CLAP/0.7
Classification	ESC-50	Top-1 Accuracy	97.4	M2D-CLAP/0.7
Classification	AudioSet	Test mAP	0.485	M2D-CLAP/0.7

Related Papers

RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18 Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17 A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17 Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17 Best Practices for Large-Scale, Pixel-Wise Crop Mapping and Transfer Learning Workflows2025-07-16 Robust-Multi-Task Gradient Boosting2025-07-15 Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14