TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Language Identification Using Deep Convolutional Recurrent...

Language Identification Using Deep Convolutional Recurrent Neural Networks

Christian Bartz, Tom Herold, Haojin Yang, Christoph Meinel

2017-08-16Speech RecognitionAutomatic Speech RecognitionLanguage IdentificationAutomatic Speech Recognition (ASR)speech-recognitionSpoken language identificationGeneral Classification
PaperPDFCode(official)

Abstract

Language Identification (LID) systems are used to classify the spoken language from a given audio sample and are typically the first step for many spoken language processing tasks, such as Automatic Speech Recognition (ASR) systems. Without automatic language detection, speech utterances cannot be parsed correctly and grammar rules cannot be applied, causing subsequent speech recognition steps to fail. We propose a LID system that solves the problem in the image domain, rather than the audio domain. We use a hybrid Convolutional Recurrent Neural Network (CRNN) that operates on spectrogram images of the provided audio snippets. In extensive experiments we show, that our model is applicable to a range of noisy scenarios and can easily be extended to previously unknown languages, while maintaining its classification accuracy. We release our code and a large scale training set for LID systems to the community.

Results

TaskDatasetMetricValueModel
DialogueYouTube News dataset (Background Music)Accuracy 0.89Inception-v3 CRNN
DialogueYouTube News dataset (Background Music)F1 Score0.89Inception-v3 CRNN
DialogueYouTube News dataset (Background Music)Accuracy 0.7CRNN
DialogueYouTube News dataset (Background Music)F1 Score0.7CRNN
DialogueYouTube News dataset (No Noise)Accuracy 0.96Inception-v3 CRNN
DialogueYouTube News dataset (No Noise)F1 Score0.96Inception-v3 CRNN
DialogueYouTube News dataset (No Noise)Accuracy 0.91CRNN
DialogueYouTube News dataset (No Noise)F1 Score0.91CRNN
DialogueYouTube News dataset (Crackling Noise)Accuracy 0.93Inception-v3 CRNN
DialogueYouTube News dataset (Crackling Noise)F1 Score0.93Inception-v3 CRNN
DialogueYouTube News dataset (Crackling Noise)Accuracy 0.82CRNN
DialogueYouTube News dataset (Crackling Noise)F1 Score0.83CRNN
DialogueYouTube News dataset (White Noise)Accuracy 0.91Inception-v3 CRNN
DialogueYouTube News dataset (White Noise)F1 Score0.91Inception-v3 CRNN
DialogueYouTube News dataset (White Noise)Accuracy 0.63CRNN
DialogueYouTube News dataset (White Noise)F1 Score0.63CRNN
Spoken Language UnderstandingYouTube News dataset (Background Music)Accuracy 0.89Inception-v3 CRNN
Spoken Language UnderstandingYouTube News dataset (Background Music)F1 Score0.89Inception-v3 CRNN
Spoken Language UnderstandingYouTube News dataset (Background Music)Accuracy 0.7CRNN
Spoken Language UnderstandingYouTube News dataset (Background Music)F1 Score0.7CRNN
Spoken Language UnderstandingYouTube News dataset (No Noise)Accuracy 0.96Inception-v3 CRNN
Spoken Language UnderstandingYouTube News dataset (No Noise)F1 Score0.96Inception-v3 CRNN
Spoken Language UnderstandingYouTube News dataset (No Noise)Accuracy 0.91CRNN
Spoken Language UnderstandingYouTube News dataset (No Noise)F1 Score0.91CRNN
Spoken Language UnderstandingYouTube News dataset (Crackling Noise)Accuracy 0.93Inception-v3 CRNN
Spoken Language UnderstandingYouTube News dataset (Crackling Noise)F1 Score0.93Inception-v3 CRNN
Spoken Language UnderstandingYouTube News dataset (Crackling Noise)Accuracy 0.82CRNN
Spoken Language UnderstandingYouTube News dataset (Crackling Noise)F1 Score0.83CRNN
Spoken Language UnderstandingYouTube News dataset (White Noise)Accuracy 0.91Inception-v3 CRNN
Spoken Language UnderstandingYouTube News dataset (White Noise)F1 Score0.91Inception-v3 CRNN
Spoken Language UnderstandingYouTube News dataset (White Noise)Accuracy 0.63CRNN
Spoken Language UnderstandingYouTube News dataset (White Noise)F1 Score0.63CRNN
Dialogue UnderstandingYouTube News dataset (Background Music)Accuracy 0.89Inception-v3 CRNN
Dialogue UnderstandingYouTube News dataset (Background Music)F1 Score0.89Inception-v3 CRNN
Dialogue UnderstandingYouTube News dataset (Background Music)Accuracy 0.7CRNN
Dialogue UnderstandingYouTube News dataset (Background Music)F1 Score0.7CRNN
Dialogue UnderstandingYouTube News dataset (No Noise)Accuracy 0.96Inception-v3 CRNN
Dialogue UnderstandingYouTube News dataset (No Noise)F1 Score0.96Inception-v3 CRNN
Dialogue UnderstandingYouTube News dataset (No Noise)Accuracy 0.91CRNN
Dialogue UnderstandingYouTube News dataset (No Noise)F1 Score0.91CRNN
Dialogue UnderstandingYouTube News dataset (Crackling Noise)Accuracy 0.93Inception-v3 CRNN
Dialogue UnderstandingYouTube News dataset (Crackling Noise)F1 Score0.93Inception-v3 CRNN
Dialogue UnderstandingYouTube News dataset (Crackling Noise)Accuracy 0.82CRNN
Dialogue UnderstandingYouTube News dataset (Crackling Noise)F1 Score0.83CRNN
Dialogue UnderstandingYouTube News dataset (White Noise)Accuracy 0.91Inception-v3 CRNN
Dialogue UnderstandingYouTube News dataset (White Noise)F1 Score0.91Inception-v3 CRNN
Dialogue UnderstandingYouTube News dataset (White Noise)Accuracy 0.63CRNN
Dialogue UnderstandingYouTube News dataset (White Noise)F1 Score0.63CRNN

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06First Steps Towards Voice Anonymization for Code-Switching Speech2025-07-02MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01AUTOMATIC PRONUNCIATION MISTAKE DETECTOR PROJECT REPORT2025-06-25