Nordic Language Identification

TextsIntroduced 2020-12-11

Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine-learning approach for automatic language identification for the Nordic languages, which often suffer miscategorization by existing state-of-the-art tools. Concretely we will focus on discrimination between six Nordic languages: Danish, Swedish, Norwegian (Nynorsk), Norwegian (Bokmål), Faroese, and Icelandic. This is the data for the tasks. Two variants are provided: 10K and 50K, withholding 10,000 and 50,000 examples for each language respectively.

This dataset is in six similar Nordic languages:

  1. Danish, da
  2. Faroese, fo
  3. Icelandic, is
  4. Norwegian Bokmål, nb
  5. Norwegian Nynorsk, nn
  6. Swedish, sv