Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models

Jing-Xuan Zhang, Genshun Wan, Jianqing Gao, Zhen-Hua Ling

2025-02-09Speech Recognition Automatic Speech Recognition Representation Learning Automatic Speech Recognition (ASR)speech-recognition Audio-Visual Speech Recognition Visual Speech Recognition Lipreading Knowledge Distillation

Paper PDF Code(official)

Abstract

Audio-visual representation learning is crucial for advancing multimodal speech processing tasks, such as lipreading and audio-visual speech recognition. Recently, speech foundation models (SFMs) have shown remarkable generalization capabilities across various speech-related tasks. Building on this progress, we propose an audio-visual representation learning model that leverages cross-modal knowledge distillation from SFMs. In our method, SFMs serve as teachers, from which multi-layer hidden representations are extracted using clean audio inputs. We also introduce a multi-teacher ensemble method to distill the student, which receives audio-visual data as inputs. A novel representational knowledge distillation loss is employed to train the student during pretraining, which is also applied during finetuning to further enhance the performance on downstream tasks. Our experiments utilized both a self-supervised SFM, WavLM, and a supervised SFM, iFLYTEK-speech. The results demonstrated that our proposed method achieved superior or at least comparable performance to previous state-of-the-art baselines across automatic speech recognition, visual speech recognition, and audio-visual speech recognition tasks. Additionally, comprehensive ablation studies and the visualization of learned representations were conducted to evaluate the effectiveness of our proposed method.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	LRS3-TED	WER	1.4	DistillAV
Audio-Visual Speech Recognition	LRS3-TED	Word Error Rate (WER)	1.3	DistillAV
Lipreading	LRS3-TED	Word Error Rate (WER)	26.2	DistillAV
Natural Language Transduction	LRS3-TED	Word Error Rate (WER)	26.2	DistillAV
Automatic Speech Recognition (ASR)	LRS3-TED	WER	1.4	DistillAV

Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models

Abstract

Results

Related Papers

Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models

Abstract

Results

Related Papers