FuLG

TextsODC-BYIntroduced 2024-07-18

FuLG is a comprehensive Romanian language corpus comprising 150 billion tokens, carefully extracted from Common Crawl. This extensive dataset is the result of rigorous filtering and deduplication processes applied to 95 Common Crawl snapshots. The compressed dataset has 289 GB.