Discriminating Between Similar Nordic Languages

René Haas, Leon Derczynski

2020-12-11EACL (VarDial) 2021 4Language Identification BIG-bench Machine Learning

Abstract

Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine learning approach for automatic language identification for the Nordic languages, which often suffer miscategorisation by existing state-of-the-art tools. Concretely we will focus on discrimination between six Nordic languages: Danish, Swedish, Norwegian (Nynorsk), Norwegian (Bokm{\aa}l), Faroese and Icelandic.

Results

Task	Dataset	Metric	Value	Model
Language Identification	Nordic Language Identification	Accuracy	0.9711	FastText

Related Papers

mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks2025-06-10 Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?2025-06-10 Recursive Semantic Anchoring in ISO 639:2023: A Structural Extension to ISO/TC 37 Frameworks2025-06-07 TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge2025-06-02 Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC2025-05-30 CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training2025-05-23 Token Masking Improves Transformer-Based Text Classification2025-05-16 Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language2025-05-10