A reproduction of Apple's bi-directional LSTM models for language identification in short strings

Mads Toftrup, Søren Asger Sørensen, Manuel R. Ciosici, Ira Assent

2021-02-11EACL 2021 2Language Identification

Abstract

Language Identification is the task of identifying a document's language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model's performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.

Results

Task	Dataset	Metric	Value	Model
Language Identification	Universal Dependencies	Accuracy	86.93	Apple bi-LSTM
Language Identification	OpenSubtitles	Accuracy	91.37	Apple bi-LSTM

Related Papers

mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks2025-06-10 Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?2025-06-10 Recursive Semantic Anchoring in ISO 639:2023: A Structural Extension to ISO/TC 37 Frameworks2025-06-07 TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge2025-06-02 Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC2025-05-30 CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training2025-05-23 Token Masking Improves Transformer-Based Text Classification2025-05-16 Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language2025-05-10