Papers With Code 2 | ML Benchmarks, SotA Results & Code

WMT 2021 Ge'ez-Amharic is a Ge'ez-Amharic dataset prepared for NMT tasks of the 6th Workshop on NLP at Debre Berhan University, Ethiopia. The corpus has been collected from:

Ethiopian Orthodox Church old bible (from ethiopianorthodox.org), Anaphora, praise of St. Virgin Mary, praise of Lord Jesus and other Church's books.
Ge'ez teaching books,
Websites and other internet sources such as www.geez.org, www.debelo.org,

The Dataset has about 15454 parallel Ge'ez and Amharic sentences for training, 1001 parallel sentences for testing and 1001 parallel sentences for validation.