WikiNews Dataset

WikiNews Arabic Diacritization Benchmark Dataset

TextsIntroduced 2016-10-29

The WikiNews Arabic Diacritization dataset is a test set composed of 70 WikiNews articles (majority are from 2013 and 2014) that cover a variety of themes, namely: politics, economics, health, science and technology, sports, arts, and culture. The articles are evenly distributed among the different themes (10 per theme). The articles contain 18,300 words with around 400 different sentences (Each line is considered as a sentence).