Papers With Code 2 | ML Benchmarks, SotA Results & Code

The OnlySports Dataset is a comprehensive collection of sports-related text data, comprising approximately 600 billion tokens. This massive corpus was carefully curated from the FineWeb dataset, a cleaned subset of CommonCrawl spanning from 2013 to present. The dataset creation involved a two-step process:

URL filtering using sports-related keywords to identify potentially relevant content.
Application of a custom-built sports text classifier to accurately extract sports-specific documents.

At 1.2 TB in size, OnlySports Dataset represents the largest known sports-domain dataset to date. It encompasses a wide range of content including news articles, blogs, match reports, interviews, and tutorials, providing a rich resource for training domain-specific language models in the sports field.