Sangraha

CC BY 4.0Introduced 2024-03-11

Sangraha is the largest high-quality, cleaned Indic language pretraining data containing 251B tokens summed up over 22 languages, extracted from curated sources, existing multilingual corpora and large-scale translations.