TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/From N-grams to Pre-trained Multilingual Models For Langua...

From N-grams to Pre-trained Multilingual Models For Language Identification

Thapelo Sindane, Vukosi Marivate

2024-10-11Language IdentificationXLM-R
PaperPDFCodeCode(official)

Abstract

In this paper, we investigate the use of N-gram models and Large Pre-trained Multilingual models for Language Identification (LID) across 11 South African languages. For N-gram models, this study shows that effective data size selection remains crucial for establishing effective frequency distributions of the target languages, that efficiently model each language, thus, improving language ranking. For pre-trained multilingual models, we conduct extensive experiments covering a diverse set of massively pre-trained multilingual (PLM) models -- mBERT, RemBERT, XLM-r, and Afri-centric multilingual models -- AfriBERTa, Afro-XLMr, AfroLM, and Serengeti. We further compare these models with available large-scale Language Identification tools: Compact Language Detector v3 (CLD V3), AfroLID, GlotLID, and OpenLID to highlight the importance of focused-based LID. From these, we show that Serengeti is a superior model across models: N-grams to Transformers on average. Moreover, we propose a lightweight BERT-based LID model (za_BERT_lid) trained with NHCLT + Vukzenzele corpus, which performs on par with our best-performing Afri-centric models.

Related Papers

mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks2025-06-10Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?2025-06-10Recursive Semantic Anchoring in ISO 639:2023: A Structural Extension to ISO/TC 37 Frameworks2025-06-07TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge2025-06-02Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC2025-05-30CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training2025-05-23Cross-Linguistic Transfer in Multilingual NLP: The Role of Language Families and Morphology2025-05-20Token Masking Improves Transformer-Based Text Classification2025-05-16