TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Does Transliteration Help Multilingual Language Modeling?

Does Transliteration Help Multilingual Language Modeling?

Ibraheem Muhammad Moosa, Mahmud Elahi Akhter, Ashfia Binte Habib

2022-01-29Sentiment AnalysisNews ClassificationTransliterationNamed Entity Recognition (NER)Language ModellingMultiple Choice Question Answering (MCQA)
PaperPDFCode(official)

Abstract

Script diversity presents a challenge to Multilingual Language Models (MLLM) by reducing lexical overlap among closely related languages. Therefore, transliterating closely related languages that use different writing scripts to a common script may improve the downstream task performance of MLLMs. We empirically measure the effect of transliteration on MLLMs in this context. We specifically focus on the Indic languages, which have the highest script diversity in the world, and we evaluate our models on the IndicGLUE benchmark. We perform the Mann-Whitney U test to rigorously verify whether the effect of transliteration is significant or not. We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages. We also measure the cross-lingual representation similarity of the models using centered kernel alignment on parallel sentences from the FLORES-101 dataset. We find that for parallel sentences across different languages, the transliteration-based model learns sentence representations that are more similar.

Results

TaskDatasetMetricValueModel
Question AnsweringIndicGLUE WSTP PaAccuracy77.55xlmindic-base-uniscript
Question AnsweringIndicGLUE WSTP PaAccuracy74.33xlmindic-base-multiscript
Sentiment AnalysisIITP Movie Reviews SentimentAccuracy66.34xlmindic-base-uniscript
Sentiment AnalysisIITP Movie Reviews SentimentAccuracy65.91xlmindic-base-multiscript
Sentiment AnalysisIITP Product Reviews SentimentAccuracy77.18xlmindic-base-uniscript
Sentiment AnalysisIITP Product Reviews SentimentAccuracy76.33xlmindic-base-multiscript
Cross-LingualBBC Hindi News Article ClassificationAccuracy79.14xlmindic-base-uniscript
Cross-LingualBBC Hindi News Article ClassificationAccuracy77.28xlmindic-base-multiscript
Cross-LingualSoham News Article ClassificationAccuracy93.89xlmindic-base-uniscript
Cross-LingualSoham News Article ClassificationAccuracy93.22xlmindic-base-multiscript
Cross-Lingual Document ClassificationBBC Hindi News Article ClassificationAccuracy79.14xlmindic-base-uniscript
Cross-Lingual Document ClassificationBBC Hindi News Article ClassificationAccuracy77.28xlmindic-base-multiscript
Cross-Lingual Document ClassificationSoham News Article ClassificationAccuracy93.89xlmindic-base-uniscript
Cross-Lingual Document ClassificationSoham News Article ClassificationAccuracy93.22xlmindic-base-multiscript

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Tri-Learn Graph Fusion Network for Attributed Graph Clustering2025-07-18AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16