TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/M2D2: Exploring General-purpose Audio-Language Representat...

M2D2: Exploring General-purpose Audio-Language Representations Beyond CLAP

Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda, Binh Thien Nguyen, Yasunori Ohishi, Noboru Harada

2025-03-28Vocal technique classificationText to Audio RetrievalRepresentation LearningAudio ClassificationText RetrievalAudio to Text RetrievalSelf-Supervised LearningMusic Genre ClassificationAudio TaggingAudio captioningMusic Auto-TaggingSentence EmbeddingsMusic ClassificationMusic TaggingInstrument RecognitionEmotion RecognitionSinger Identification
PaperPDFCodeCode

Abstract

Contrastive language-audio pre-training (CLAP) has addressed audio-language tasks such as audio-text retrieval by aligning audio and text in a common feature space. While CLAP addresses general audio-language tasks, its audio features do not generalize well in audio tasks. In contrast, self-supervised learning (SSL) models learn general-purpose audio features that perform well in diverse audio tasks. We pursue representation learning that can be widely used in audio applications and hypothesize that a method that learns both general audio features and CLAP features should achieve our goal, which we call a general-purpose audio-language representation. To implement our hypothesis, we propose M2D2, a second-generation masked modeling duo (M2D) that combines an SSL M2D and CLAP. M2D2 learns two types of features using two modalities (audio and text) in a two-stage training process. It also utilizes advanced LLM-based sentence embeddings in CLAP training for powerful semantic supervision. In the first stage, M2D2 learns generalizable audio features from M2D and CLAP, where CLAP aligns the features with the fine LLM-based semantic embeddings. In the second stage, it learns CLAP features using the audio features learned from the LLM-based embeddings. Through these pre-training stages, M2D2 should enhance generalizability and performance in its audio and CLAP features. Experiments validated that M2D2 achieves effective general-purpose audio-language representation, highlighted with SOTA fine-tuning mAP of 49.0 for AudioSet, SOTA performance in music tasks, and top-level performance in audio-language tasks.

Results

TaskDatasetMetricValueModel
Music Auto-TaggingMagnaTagATunePR-AUC41.6M2D2 AS+
Music Auto-TaggingMagnaTagATuneROC AUC91.8M2D2 AS+
Emotion RecognitionEmomusicEmoA77.4M2D-CLAP
Emotion RecognitionEmomusicEmoV61.9M2D-CLAP
Emotion RecognitionEmomusicEmoA76.7M2D2
Emotion RecognitionEmomusicEmoV59.3M2D2
Emotion RecognitionEmomusicEmoA76.1M2D
Emotion RecognitionEmomusicEmoV59.4M2D
Music ClassificationVocalSetAccuracy92.7M2D2 AS+
Music ClassificationVocalSetAccuracy91.8M2D2
Music ClassificationVocalSetAccuracy 78.9M2D2 AS+
Music ClassificationVocalSetAccuracy 77.4M2D2
Audio ClassificationESC-50Accuracy (5-fold)98.5M2D2 AS+
Audio ClassificationESC-50Top-1 Accuracy98.5M2D2 AS+
Audio ClassificationAudioSetTest mAP0.49M2D2
ClassificationESC-50Accuracy (5-fold)98.5M2D2 AS+
ClassificationESC-50Top-1 Accuracy98.5M2D2 AS+
ClassificationAudioSetTest mAP0.49M2D2
Instrument RecognitionNSynthAccuracy80.6M2D-CLAP
Instrument RecognitionNSynthAccuracy79.7M2D2 AS+
Instrument RecognitionNSynthAccuracy78.7M2D AS

Related Papers

Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation2025-07-21Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17