Fine-Tashkeel: Finetuning Byte-Level Models for Accurate Arabic Text Diacritization

Bashar Al-Rfooh, Gheith Abandah, Rami Al-Rfou

2023-03-25Feature Engineering Arabic Text Diacritization

Abstract

Most of previous work on learning diacritization of the Arabic language relied on training models from scratch. In this paper, we investigate how to leverage pre-trained language models to learn diacritization. We finetune token-free pre-trained multilingual models (ByT5) to learn to predict and insert missing diacritics in Arabic text, a complex task that requires understanding the sentence semantics and the morphological structure of the tokens. We show that we can achieve state-of-the-art on the diacritization task with minimal amount of training and no feature engineering, reducing WER by 40%. We release our finetuned models for the greater benefit of the researchers in the community.

Related Papers

Advancing Magnetic Materials Discovery -- A structure-based machine learning approach for magnetic ordering and magnetic moment prediction2025-07-02 Prompt Mechanisms in Medical Imaging: A Comprehensive Survey2025-06-28 Quantum Reinforcement Learning Trading Agent for Sector Rotation in the Taiwan Stock Market2025-06-26 Temporal-Aware Graph Attention Network for Cryptocurrency Transaction Fraud Detection2025-06-26 Tabular Feature Discovery With Reasoning Type Exploration2025-06-25 A Deep Learning Approach to Identify Rock Bolts in Complex 3D Point Clouds of Underground Mines Captured Using Mobile Laser Scanners2025-06-25 Relational Deep Learning: Challenges, Foundations and Next-Generation Architectures2025-06-19 Enhancing Forecasting Accuracy in Dynamic Environments via PELT-Driven Drift Detection and Model Adaptation2025-06-17