TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CATT: Character-based Arabic Tashkeel Transformer

CATT: Character-based Arabic Tashkeel Transformer

Faris Alasmary, Orjuwan Zaafarani, Ahmad Ghannam

2024-07-03Machine TranslationText to SpeechArabic Text Diacritizationtext-to-speech
PaperPDFCode(official)

Abstract

Tashkeel, or Arabic Text Diacritization (ATD), greatly enhances the comprehension of Arabic text by removing ambiguity and minimizing the risk of misinterpretations caused by its absence. It plays a crucial role in improving Arabic text processing, particularly in applications such as text-to-speech and machine translation. This paper introduces a new approach to training ATD models. First, we finetuned two transformers, encoder-only and encoder-decoder, that were initialized from a pretrained character-based BERT. Then, we applied the Noisy-Student approach to boost the performance of the best model. We evaluated our models alongside 11 commercial and open-source models using two manually labeled benchmark datasets: WikiNews and our CATT dataset. Our findings show that our top model surpasses all evaluated models by relative Diacritic Error Rates (DERs) of 30.83\% and 35.21\% on WikiNews and CATT, respectively, achieving state-of-the-art in ATD. In addition, we show that our model outperforms GPT-4-turbo on CATT dataset by a relative DER of 9.36\%. We open-source our CATT models and benchmark dataset for the research community\footnote{https://github.com/abjadai/catt}.

Results

TaskDatasetMetricValueModel
Arabic Text DiacritizationCATTDER(%)8.624CATT ED
Arabic Text DiacritizationCATTWER (%)34.191CATT ED
Arabic Text DiacritizationCATTDER(%)8.762CATT EO
Arabic Text DiacritizationCATTWER (%)35.597CATT EO
Arabic Text DiacritizationCATTDER(%)9.515GPT-4
Arabic Text DiacritizationCATTWER (%)38.311GPT-4
Arabic Text DiacritizationCATTDER(%)10.808CBHG
Arabic Text DiacritizationCATTWER (%)42.68CBHG
Arabic Text DiacritizationCATTDER(%)13.169Command R+
Arabic Text DiacritizationCATTWER (%)48.518Command R+
Arabic Text DiacritizationCATTDER(%)13.494Shakkala
Arabic Text DiacritizationCATTWER (%)50.387Shakkala
Arabic Text DiacritizationCATTDER(%)13.841Sakhr
Arabic Text DiacritizationCATTWER (%)56.661Sakhr
Arabic Text DiacritizationCATTDER(%)14.232Alkhalil
Arabic Text DiacritizationCATTWER (%)53.413Alkhalil
Arabic Text DiacritizationCATTDER(%)16.482Multilevel Diacritizer
Arabic Text DiacritizationCATTWER (%)60.844Multilevel Diacritizer

Related Papers

Hear Your Code Fail, Voice-Assisted Debugging for Python2025-07-20NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge2025-07-15An Empirical Evaluation of AI-Powered Non-Player Characters' Perceived Realism and Performance in Virtual Reality Environments2025-07-14ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching2025-07-12Exploiting Leaderboards for Large-Scale Distribution of Malicious Models2025-07-11MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling2025-07-11Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation2025-07-09