SMC Text Corpus

TextsCreative Commons Attribution-ShareAlikeIntroduced 2019-03-02

Contents (As on March 4, 2019)

The text corpus contains running text from various free licensed sources.

  • The whole content of Malayalam Wikipedia extracted on January 1, 2019
  • News/Article from various sources, source mentioned in respective files:
  • 251 Mb
  • 8,60,159 lines
  • 98,15,533 words
  • 10,11,11,885 characters

The word corpus contains