From N-grams to Pre-trained Multilingual Models For Language Identification

Thapelo Sindane, Vukosi Marivate

2024-10-11Language Identification XLM-R

Abstract

In this paper, we investigate the use of N-gram models and Large Pre-trained Multilingual models for Language Identification (LID) across 11 South African languages. For N-gram models, this study shows that effective data size selection remains crucial for establishing effective frequency distributions of the target languages, that efficiently model each language, thus, improving language ranking. For pre-trained multilingual models, we conduct extensive experiments covering a diverse set of massively pre-trained multilingual (PLM) models -- mBERT, RemBERT, XLM-r, and Afri-centric multilingual models -- AfriBERTa, Afro-XLMr, AfroLM, and Serengeti. We further compare these models with available large-scale Language Identification tools: Compact Language Detector v3 (CLD V3), AfroLID, GlotLID, and OpenLID to highlight the importance of focused-based LID. From these, we show that Serengeti is a superior model across models: N-grams to Transformers on average. Moreover, we propose a lightweight BERT-based LID model (za_BERT_lid) trained with NHCLT + Vukzenzele corpus, which performs on par with our best-performing Afri-centric models.

Related Papers

mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks2025-06-10 Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?2025-06-10 Recursive Semantic Anchoring in ISO 639:2023: A Structural Extension to ISO/TC 37 Frameworks2025-06-07 TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge2025-06-02 Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC2025-05-30 CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training2025-05-23 Cross-Linguistic Transfer in Multilingual NLP: The Role of Language Families and Morphology2025-05-20 Token Masking Improves Transformer-Based Text Classification2025-05-16