Does Transliteration Help Multilingual Language Modeling?

Ibraheem Muhammad Moosa, Mahmud Elahi Akhter, Ashfia Binte Habib

2022-01-29Sentiment Analysis News Classification Transliteration Named Entity Recognition (NER)Language Modelling Multiple Choice Question Answering (MCQA)

Paper PDF Code(official)

Abstract

Script diversity presents a challenge to Multilingual Language Models (MLLM) by reducing lexical overlap among closely related languages. Therefore, transliterating closely related languages that use different writing scripts to a common script may improve the downstream task performance of MLLMs. We empirically measure the effect of transliteration on MLLMs in this context. We specifically focus on the Indic languages, which have the highest script diversity in the world, and we evaluate our models on the IndicGLUE benchmark. We perform the Mann-Whitney U test to rigorously verify whether the effect of transliteration is significant or not. We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages. We also measure the cross-lingual representation similarity of the models using centered kernel alignment on parallel sentences from the FLORES-101 dataset. We find that for parallel sentences across different languages, the transliteration-based model learns sentence representations that are more similar.

Results

Task	Dataset	Metric	Value	Model
Question Answering	IndicGLUE WSTP Pa	Accuracy	77.55	xlmindic-base-uniscript
Question Answering	IndicGLUE WSTP Pa	Accuracy	74.33	xlmindic-base-multiscript
Sentiment Analysis	IITP Movie Reviews Sentiment	Accuracy	66.34	xlmindic-base-uniscript
Sentiment Analysis	IITP Movie Reviews Sentiment	Accuracy	65.91	xlmindic-base-multiscript
Sentiment Analysis	IITP Product Reviews Sentiment	Accuracy	77.18	xlmindic-base-uniscript
Sentiment Analysis	IITP Product Reviews Sentiment	Accuracy	76.33	xlmindic-base-multiscript
Cross-Lingual	BBC Hindi News Article Classification	Accuracy	79.14	xlmindic-base-uniscript
Cross-Lingual	BBC Hindi News Article Classification	Accuracy	77.28	xlmindic-base-multiscript
Cross-Lingual	Soham News Article Classification	Accuracy	93.89	xlmindic-base-uniscript
Cross-Lingual	Soham News Article Classification	Accuracy	93.22	xlmindic-base-multiscript
Cross-Lingual Document Classification	BBC Hindi News Article Classification	Accuracy	79.14	xlmindic-base-uniscript
Cross-Lingual Document Classification	BBC Hindi News Article Classification	Accuracy	77.28	xlmindic-base-multiscript
Cross-Lingual Document Classification	Soham News Article Classification	Accuracy	93.89	xlmindic-base-uniscript
Cross-Lingual Document Classification	Soham News Article Classification	Accuracy	93.22	xlmindic-base-multiscript

Does Transliteration Help Multilingual Language Modeling?

Abstract

Results

Related Papers

Does Transliteration Help Multilingual Language Modeling?

Abstract

Results

Related Papers