TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Diacritics Restoration using BERT with Analysis on Czech l...

Diacritics Restoration using BERT with Analysis on Czech language

Jakub Náplava, Milan Straka, Jana Straková

2021-05-24Turkish Text DiacritizationHungarian Text DiacritizationCzech Text DiacritizationSlovak Text DiacritizationIrish Text DiacritizationPolish Text DiacritizationVietnamese Text DiacritizationCroatian Text DiacritizationFrench Text DiacritizationLatvian Text DiacritizationSpanish Text DiacritizationRomanian Text Diacritization
PaperPDFCode(official)

Abstract

We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it on 12 languages with diacritics. Furthermore, we conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization. Notably, we manually annotate all mispredictions, showing that roughly 44% of them are actually not errors, but either plausible variants (19%), or the system corrections of erroneous data (25%). Finally, we categorize the real errors in detail. We release the code at https://github.com/ufal/bert-diacritics-restoration.

Results

TaskDatasetMetricValueModel
Czech Text DiacritizationMultilingual Dataset for Training and Evaluating Diacritics Restoration SystemsAlpha-Word accuracy99.22BERT
Vietnamese Text DiacritizationMultilingual Dataset for Training and Evaluating Diacritics Restoration SystemsAlpha-Word accuracy98.53BERT
Romanian Text DiacritizationMultilingual Dataset for Training and Evaluating Diacritics Restoration SystemsAlpha-Word accuracy98.64BERT
Slovak Text DiacritizationMultilingual Dataset for Training and Evaluating Diacritics Restoration SystemsAlpha-Word accuracy99.32BERT
Latvian Text DiacritizationMultilingual Dataset for Training and Evaluating Diacritics Restoration SystemsAlpha-Word accuracy98.63BERT
Polish Text DiacritizationMultilingual Dataset for Training and Evaluating Diacritics Restoration SystemsAlpha-Word accuracy99.66BERT
Irish Text DiacritizationMultilingual Dataset for Training and Evaluating Diacritics Restoration SystemsAlpha-Word accuracy98.88BERT
Hungarian Text DiacritizationMultilingual Dataset for Training and Evaluating Diacritics Restoration SystemsAlpha-Word accuracy99.41BERT
French Text DiacritizationMultilingual Dataset for Training and Evaluating Diacritics Restoration SystemsAlpha-Word accuracy99.71BERT
Turkish Text DiacritizationMultilingual Dataset for Training and Evaluating Diacritics Restoration SystemsAlpha-Word accuracy98.95BERT
Spanish Text DiacritizationMultilingual Dataset for Training and Evaluating Diacritics Restoration SystemsAlpha-Word accuracy99.62BERT
Croatian Text DiacritizationMultilingual Dataset for Training and Evaluating Diacritics Restoration SystemsAlpha-Word accuracy99.73BERT

Related Papers

Diacritics Restoration Using Neural Networks2018-05-01