SMC Text Corpus
TextsCreative Commons Attribution-ShareAlikeIntroduced 2019-03-02
Contents (As on March 4, 2019)
The text corpus contains running text from various free licensed sources.
- The whole content of Malayalam Wikipedia extracted on January 1, 2019
- News/Article from various sources, source mentioned in respective files:
- 251 Mb
- 8,60,159 lines
- 98,15,533 words
- 10,11,11,885 characters
The word corpus contains
- Classified lexicon prepared for Malaylam Morphology Analyser project
- Unique words extracted from Malayalam Wikipedia, Wictionary etc.
- 14,27,392 words