CulturaX

Introduced 2023-09-17

We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for large language model (LLM) development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. We employ MinHash at document level to achieve fuzzy deduplication for the datasets in different languages. Our data cleaning framework includes diverse criteria and threshold selections, guided by extensive data samples, ensuring comprehensive noise filtering in various aspects. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs.

Our dataset combines the most recent iteration of mC4 (version 3.1.0) [1] with all accessible OSCAR corpora up to the present year, including 20.19, 21.09, 22.01, and 23.01 [2]. After deep cleaning and deduplication, CulturaX involves 16TB data in the parquet format (expanding to 27TB when unpacked). More than a half of our dataset is dedicated to non-English languages to significantly boost the data size and enhance the feasibility of training models in multilingual scenarios.

To obtain perplexity scores for data cleaning, we train a SentencePiece tokenizer and 5-gram Kneser-Ney language models as provided in the KenLM library [3] using the 20230501 dumps of Wikipedia. Our KenLM models are also released in HuggingFace: https://huggingface.co/uonlp/kenlm.

Details for the dataset can be found in our technical paper: https://arxiv.org/abs/2309.09400 and https://huggingface.co/datasets/uonlp/CulturaX