CCI 3.0-HQ
TextsIntroduced 2024-10-24
To address the scarcity of high-quality safety datasets in the Chinese, we open-sourced the CCI (Chinese Corpora Internet) dataset on November 29, 2023. Building on this foundation, we continue to expand the data source, adopt stricter data cleaning methods, and complete the construction of the CCI 3.0 dataset. This dataset is composed of high-quality, reliable Internet data from trusted sources. And then with more stricter filtering, The CCI 3.0 HQ corpus released is about 500GB in size.