ATST: Audio Representation Learning with Teacher-Student Transformer

Xian Li, Xiaofei Li

2022-04-26Speaker Identification Representation Learning Audio Classification Self-Supervised Learning Self-Supervised Audio Classification Instrument Recognition

Paper PDF Code Code Code(official)Code

Abstract

Self-supervised learning (SSL) learns knowledge from a large amount of unlabeled data, and then transfers the knowledge to a specific problem with a limited number of labeled data. SSL has achieved promising results in various domains. This work addresses the problem of segment-level general audio SSL, and proposes a new transformer-based teacher-student SSL model, named ATST. A transformer encoder is developed on a recently emerged teacher-student baseline scheme, which largely improves the modeling capability of pre-training. In addition, a new strategy for positive pair creation is designed to fully leverage the capability of transformer. Extensive experiments have been conducted, and the proposed model achieves the new state-of-the-art results on almost all of the downstream tasks.

Results

Task	Dataset	Metric	Value	Model
Speaker Identification	VoxCeleb1	Accuracy	94.3	ATST Base (ours)
Speaker Identification	VoxCeleb1	Top-1 (%)	94.3	ATST Base (ours)
Audio Classification	Balanced Audio Set	Mean AP	37.4	Base (ours)
Classification	Balanced Audio Set	Mean AP	37.4	Base (ours)
Spoken Command Recognition	Speech Command v2	Accuracy	98	Base (ours)

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20 Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17 Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17 Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17 A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17 Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16 Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16