Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning

Aurian Quelennec, Pierre Chouteau, Geoffroy Peeters, Slim Essid

2025-02-17ICASSP 2025 3Environmental Sound Classification TAG Representation Learning Audio Classification Self-Supervised Learning Music Genre Classification Self-Supervised Audio Classification Audio Tagging Music Auto-Tagging Prediction Classification Music Tagging Instrument Recognition

Paper PDF Code

Abstract

Recently, self-supervised learning methods based on masked latent prediction have proven to encode input data into powerful representations. However, during training, the learned latent space can be further transformed to extract higher-level information that could be more suited for downstream classification tasks. Therefore, we propose a new method: MAsked latenT Prediction And Classification (MATPAC), which is trained with two pretext tasks solved jointly. As in previous work, the first pretext task is a masked latent prediction task, ensuring a robust input representation in the latent space. The second one is unsupervised classification, which utilises the latent representations of the first pretext task to match probability distributions between a teacher and a student. We validate the MATPAC method by comparing it to other state-of-the-art proposals and conducting ablations studies. MATPAC reaches state-of-the-art self-supervised learning results on reference audio classification datasets such as OpenMIC, GTZAN, ESC-50 and US8K and outperforms comparable supervised methods results for musical auto-tagging on Magna-tag-a-tune.

Results

Task	Dataset	Metric	Value	Model
Music Auto-Tagging	MagnaTagATune	PR-AUC	41.1	MATPAC (SSL, linear eval)
Music Auto-Tagging	MagnaTagATune	ROC AUC	91.6	MATPAC (SSL, linear eval)
Audio Classification	ESC-50	Accuracy (5-fold)	93.5	MATPAC (SSL model, linear eval)
Audio Classification	ESC-50	Top-1 Accuracy	93.5	MATPAC (SSL model, linear eval)
Audio Classification	FSD50K	mAP	55.2	MATPAC (SSL Model)
Audio Classification	UrbanSound8K	Accuracy	89.4	MATPAC (SSL, linear eval)
Environmental Sound Classification	UrbanSound8K	Accuracy	89.4	MATPAC (SSL, linear eval)
Classification	ESC-50	Accuracy (5-fold)	93.5	MATPAC (SSL model, linear eval)
Classification	ESC-50	Top-1 Accuracy	93.5	MATPAC (SSL model, linear eval)
Classification	FSD50K	mAP	55.2	MATPAC (SSL Model)
Classification	UrbanSound8K	Accuracy	89.4	MATPAC (SSL, linear eval)
Instrument Recognition	OpenMIC-2018	mean average precision	0.854	MATPAC (SSL Model, linear eval)
Instrument Recognition	NSynth	Accuracy	74.6	MATPAC (SSL, linear eval)

Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning

Abstract

Results

Related Papers

Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning

Abstract

Results

Related Papers