TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Spoken Language Identification using ConvNets

Spoken Language Identification using ConvNets

Sarthak, Shikhar Shukla, Govind Mittal

2019-10-09Keyword SpottingLanguage IdentificationSpoken language identification
PaperPDF

Abstract

Language Identification (LI) is an important first step in several speech processing systems. With a growing number of voice-based assistants, speech LI has emerged as a widely researched field. To approach the problem of identifying languages, we can either adopt an implicit approach where only the speech for a language is present or an explicit one where text is available with its corresponding transcript. This paper focuses on an implicit approach due to the absence of transcriptive data. This paper benchmarks existing models and proposes a new attention based model for language identification which uses log-Mel spectrogram images as input. We also present the effectiveness of raw waveforms as features to neural network models for LI tasks. For training and evaluation of models, we classified six languages (English, French, German, Spanish, Russian and Italian) with an accuracy of 95.4% and four languages (English, French, German, Spanish) with an accuracy of 96.3% obtained from the VoxForge dataset. This approach can further be scaled to incorporate more languages.

Results

TaskDatasetMetricValueModel
DialogueVoxForge EuropeanAccuracy (%)96.32D ConvNet(MixUp=YES)
DialogueVoxForge EuropeanAccuracy (%)962D ConvNet(MixUp=NO)
DialogueVoxForge EuropeanAccuracy (%)94.72D ConvNet with Attention and GRU(MixUp=NO)
DialogueVoxForge EuropeanAccuracy (%)94.41D ConvNet(MixUp=NO)
DialogueVoxForge EuropeanAccuracy (%)93.72D ConvNet with Attention and GRU(MixUp=YES)
DialogueVoxForge CommonwealthAccuracy (%)95.42D ConvNet(MixUp=YES)
DialogueVoxForge CommonwealthAccuracy (%)952D ConvNet with Attention and GRU(MixUp=YES)
DialogueVoxForge CommonwealthAccuracy (%)94.32D ConvNet(MixUp=NO)
DialogueVoxForge CommonwealthAccuracy (%)93.71D ConvNet(MixUp=NO)
Keyword SpottingVoxForgeAccuracy (%)93.71D-ConvNet
Keyword SpottingVoxForgeAccuracy (%)95.42D-ConvNet
Spoken Language UnderstandingVoxForge EuropeanAccuracy (%)96.32D ConvNet(MixUp=YES)
Spoken Language UnderstandingVoxForge EuropeanAccuracy (%)962D ConvNet(MixUp=NO)
Spoken Language UnderstandingVoxForge EuropeanAccuracy (%)94.72D ConvNet with Attention and GRU(MixUp=NO)
Spoken Language UnderstandingVoxForge EuropeanAccuracy (%)94.41D ConvNet(MixUp=NO)
Spoken Language UnderstandingVoxForge EuropeanAccuracy (%)93.72D ConvNet with Attention and GRU(MixUp=YES)
Spoken Language UnderstandingVoxForge CommonwealthAccuracy (%)95.42D ConvNet(MixUp=YES)
Spoken Language UnderstandingVoxForge CommonwealthAccuracy (%)952D ConvNet with Attention and GRU(MixUp=YES)
Spoken Language UnderstandingVoxForge CommonwealthAccuracy (%)94.32D ConvNet(MixUp=NO)
Spoken Language UnderstandingVoxForge CommonwealthAccuracy (%)93.71D ConvNet(MixUp=NO)
Dialogue UnderstandingVoxForge EuropeanAccuracy (%)96.32D ConvNet(MixUp=YES)
Dialogue UnderstandingVoxForge EuropeanAccuracy (%)962D ConvNet(MixUp=NO)
Dialogue UnderstandingVoxForge EuropeanAccuracy (%)94.72D ConvNet with Attention and GRU(MixUp=NO)
Dialogue UnderstandingVoxForge EuropeanAccuracy (%)94.41D ConvNet(MixUp=NO)
Dialogue UnderstandingVoxForge EuropeanAccuracy (%)93.72D ConvNet with Attention and GRU(MixUp=YES)
Dialogue UnderstandingVoxForge CommonwealthAccuracy (%)95.42D ConvNet(MixUp=YES)
Dialogue UnderstandingVoxForge CommonwealthAccuracy (%)952D ConvNet with Attention and GRU(MixUp=YES)
Dialogue UnderstandingVoxForge CommonwealthAccuracy (%)94.32D ConvNet(MixUp=NO)
Dialogue UnderstandingVoxForge CommonwealthAccuracy (%)93.71D ConvNet(MixUp=NO)

Related Papers

Enhancing Few-shot Keyword Spotting Performance through Pre-Trained Self-supervised Speech Models2025-06-21Low-resource keyword spotting using contrastively trained transformer acoustic word embeddings2025-06-21ASAP-FE: Energy-Efficient Feature Extraction Enabling Multi-Channel Keyword Spotting on Edge Processors2025-06-17GLAP: General contrastive audio-text pretraining across domains and languages2025-06-12Advances in Small-Footprint Keyword Spotting: A Comprehensive Review of Efficient Models and Algorithms2025-06-12SPBA: Utilizing Speech Large Language Model for Backdoor Attacks on Speech Classification Models2025-06-10Implementing Keyword Spotting on the MCUX947 Microcontroller with Integrated NPU2025-06-10mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks2025-06-10