IndicCorp

TextsIntroduced 2020-11-08

IndicCorp is a large monolingual corpora with around 9 billion tokens covering 12 of the major Indian languages. It has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months.

Languages covered: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu

Corpus Format: The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated.

Downloads

| Language | # News Articles* | Sentences | Tokens | Link | | -------- | ----------------- | ------------- | ------------- | -------- | | as | 0.60M | 1.39M | 32.6M | link | | bn | 3.83M | 39.9M | 836M | link | | en | 3.49M | 54.3M | 1.22B | link | | gu | 2.63M | 41.1M | 719M | link | | hi | 4.95M | 63.1M | 1.86B | link | | kn | 3.76M | 53.3M | 713M | link | | ml | 4.75M | 50.2M | 721M | link | | mr | 2.31M | 34.0M | 551M | link | | or | 0.69M | 6.94M | 107M | link | | pa | 2.64M | 29.2M | 773M | link | | ta | 4.41M | 31.5M | 582M | link | | te | 3.98M | 47.9M | 674M | link |

* Excluding articles obtained from the OSCAR corpus