WikiMatrix

TextsCC BY-SA 4.0

WikiMatrix is a dataset of parallel sentences in the textual content of Wikipedia for all possible language pairs. The mined data consists of:

  • 85 different languages, 1620 language pairs
  • 134M parallel sentences, out of which 34M are aligned with English

Source: WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia