TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning Efficient Representations for Keyword Spotting wi...

Learning Efficient Representations for Keyword Spotting with Triplet Loss

Roman Vygon, Nikolay Mikhaylovskiy

2021-01-12SPECOM 2021Speech RecognitionKeyword SpottingRepresentation Learningspeech-recognitionClassification
PaperPDFCode(official)

Abstract

In the past few years, triplet loss-based metric embeddings have become a de-facto standard for several important computer vision problems, most no-tably, person reidentification. On the other hand, in the area of speech recognition the metric embeddings generated by the triplet loss are rarely used even for classification problems. We fill this gap showing that a combination of two representation learning techniques: a triplet loss-based embedding and a variant of kNN for classification instead of cross-entropy loss significantly (by 26% to 38%) improves the classification accuracy for convolutional networks on a LibriSpeech-derived LibriWords datasets. To do so, we propose a novel phonetic similarity based triplet mining approach. We also improve the current best published SOTA for Google Speech Commands dataset V1 10+2 -class classification by about 34%, achieving 98.55% accuracy, V2 10+2-class classification by about 20%, achieving 98.37% accuracy, and V2 35-class classification by over 50%, achieving 97.0% accuracy.

Results

TaskDatasetMetricValueModel
Keyword SpottingGoogle Speech CommandsGoogle Speech Commands V1 1298.56TripletLoss-res15
Keyword SpottingGoogle Speech CommandsGoogle Speech Commands V2 1298.37TripletLoss-res15
Keyword SpottingGoogle Speech CommandsGoogle Speech Commands V2 3597TripletLoss-res15

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16