A reproduction of Apple's bi-directional LSTM models for language identification in short strings
Mads Toftrup, Søren Asger Sørensen, Manuel R. Ciosici, Ira Assent
Abstract
Language Identification is the task of identifying a document's language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model's performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.
Results
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Language Identification | Universal Dependencies | Accuracy | 86.93 | Apple bi-LSTM |
| Language Identification | OpenSubtitles | Accuracy | 91.37 | Apple bi-LSTM |
Related Papers
mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks2025-06-10Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?2025-06-10Recursive Semantic Anchoring in ISO 639:2023: A Structural Extension to ISO/TC 37 Frameworks2025-06-07TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge2025-06-02Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC2025-05-30CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training2025-05-23Token Masking Improves Transformer-Based Text Classification2025-05-16Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language2025-05-10