Direction-Aware Joint Adaptation of Neural Speech Enhancement and Recognition in Real Multiparty Conversational Environments

Yicheng Du, Aditya Arie Nugraha, Kouhei Sekiguchi, Yoshiaki Bando, Mathieu Fontaine, Kazuyoshi Yoshii

2022-07-15Speech Recognition Automatic Speech Recognition Automatic Speech Recognition (ASR)speech-recognition Distant Speech Recognition Speech Enhancement

Paper PDF

Abstract

This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication within real multiparty conversational environments. A major approach that has actively been studied in simulated environments is to sequentially perform speech enhancement and automatic speech recognition (ASR) based on deep neural networks (DNNs) trained in a supervised manner. In our task, however, such a pretrained system fails to work due to the mismatch between the training and test conditions and the head movements of the user. To enhance only the utterances of a target speaker, we use beamforming based on a DNN-based speech mask estimator that can adaptively extract the speech components corresponding to a head-relative particular direction. We propose a semi-supervised adaptation method that jointly updates the mask estimator and the ASR model at run-time using clean speech signals with ground-truth transcriptions and noisy speech signals with highly-confident estimated transcriptions. Comparative experiments using the state-of-the-art distant speech recognition system show that the proposed method significantly improves the ASR performance.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	EasyCom	WER (%)	62.36	DAJA (MVDR,HMA,1000) (Overlapped Speech)
Speech Enhancement	EasyCom	SDR	-4.76	DAJA (MVDR,HMA,1000) (Overlapped Speech)

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 Autoregressive Speech Enhancement via Acoustic Tokens2025-07-17 P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge2025-07-15 WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14 VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08 Robust One-step Speech Enhancement via Consistency Distillation2025-07-08 Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08