OnlySports Dataset
TextsIntroduced 2024-08-30
The OnlySports Dataset is a comprehensive collection of sports-related text data, comprising approximately 600 billion tokens. This massive corpus was carefully curated from the FineWeb dataset, a cleaned subset of CommonCrawl spanning from 2013 to present. The dataset creation involved a two-step process:
- URL filtering using sports-related keywords to identify potentially relevant content.
- Application of a custom-built sports text classifier to accurately extract sports-specific documents.
At 1.2 TB in size, OnlySports Dataset represents the largest known sports-domain dataset to date. It encompasses a wide range of content including news articles, blogs, match reports, interviews, and tutorials, providing a rich resource for training domain-specific language models in the sports field.