Is Attention always needed? A Case Study on Language Identification from Speech

Atanu Mandal, Santanu Pal, Indranil Dutta, Mahidas Bhattacharya, Sudip Kumar Naskar

2021-10-05Speech Recognition Automatic Speech Recognition Language Identification Automatic Speech Recognition (ASR)speech-recognition Spoken language identification General Classification

Paper PDF

Abstract

Language Identification (LID) is a crucial preliminary process in the field of Automatic Speech Recognition (ASR) that involves the identification of a spoken language from audio samples. Contemporary systems that can process speech in multiple languages require users to expressly designate one or more languages prior to utilization. The LID task assumes a significant role in scenarios where ASR systems are unable to comprehend the spoken language in multilingual settings, leading to unsuccessful speech recognition outcomes. The present study introduces convolutional recurrent neural network (CRNN) based LID, designed to operate on the Mel-frequency Cepstral Coefficient (MFCC) characteristics of audio samples. Furthermore, we replicate certain state-of-the-art methodologies, specifically the Convolutional Neural Network (CNN) and Attention-based Convolutional Recurrent Neural Network (CRNN with attention), and conduct a comparative analysis with our CRNN-based approach. We conducted comprehensive evaluations on thirteen distinct Indian languages and our model resulted in over 98\% classification accuracy. The LID model exhibits high-performance levels ranging from 97% to 100% for languages that are linguistically similar. The proposed LID model exhibits a high degree of extensibility to additional languages and demonstrates a strong resistance to noise, achieving 91.2% accuracy in a noisy setting when applied to a European Language (EU) dataset.

Results

Task	Dataset	Metric	Value	Model
Dialogue	YouTube News dataset (No Noise)	Accuracy	0.967	CRNN
Dialogue	YouTube News dataset (No Noise)	Accuracy	0.966	CRNN Attention
Dialogue	YouTube News dataset (No Noise)	Accuracy	0.948	CNN
Dialogue	IndicTTS	Classification Accuracy	0.987	CRNN
Dialogue	IndicTTS	Classification Accuracy	0.987	CRNN Attention
Dialogue	IndicTTS	Classification Accuracy	0.983	CNN
Dialogue	YouTube News dataset (White Noise)	Accuracy	0.912	CRNN
Dialogue	YouTube News dataset (White Noise)	Accuracy	0.888	CRNN Attention
Dialogue	YouTube News dataset (White Noise)	Accuracy	0.871	CNN
Spoken Language Understanding	YouTube News dataset (No Noise)	Accuracy	0.967	CRNN
Spoken Language Understanding	YouTube News dataset (No Noise)	Accuracy	0.966	CRNN Attention
Spoken Language Understanding	YouTube News dataset (No Noise)	Accuracy	0.948	CNN
Spoken Language Understanding	IndicTTS	Classification Accuracy	0.987	CRNN
Spoken Language Understanding	IndicTTS	Classification Accuracy	0.987	CRNN Attention
Spoken Language Understanding	IndicTTS	Classification Accuracy	0.983	CNN
Spoken Language Understanding	YouTube News dataset (White Noise)	Accuracy	0.912	CRNN
Spoken Language Understanding	YouTube News dataset (White Noise)	Accuracy	0.888	CRNN Attention
Spoken Language Understanding	YouTube News dataset (White Noise)	Accuracy	0.871	CNN
Dialogue Understanding	YouTube News dataset (No Noise)	Accuracy	0.967	CRNN
Dialogue Understanding	YouTube News dataset (No Noise)	Accuracy	0.966	CRNN Attention
Dialogue Understanding	YouTube News dataset (No Noise)	Accuracy	0.948	CNN
Dialogue Understanding	IndicTTS	Classification Accuracy	0.987	CRNN
Dialogue Understanding	IndicTTS	Classification Accuracy	0.987	CRNN Attention
Dialogue Understanding	IndicTTS	Classification Accuracy	0.983	CNN
Dialogue Understanding	YouTube News dataset (White Noise)	Accuracy	0.912	CRNN
Dialogue Understanding	YouTube News dataset (White Noise)	Accuracy	0.888	CRNN Attention
Dialogue Understanding	YouTube News dataset (White Noise)	Accuracy	0.871	CNN

Is Attention always needed? A Case Study on Language Identification from Speech

Abstract

Results

Related Papers

Is Attention always needed? A Case Study on Language Identification from Speech

Abstract

Results

Related Papers