One Billion Word Benchmark

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

TextsApache 2.0Introduced 2013-12-11

Text corpus with almost one billion words of training data for statistical language modeling benchmarking. The scale of approximately one billion words attempts to strike a balance between the relevance of the benchmark in a world of abundant data against the ease with which researchers can evaluate their modeling approaches. Monolingual english data was obtained from the WMT11 website and prepared using a variety of best-practices for machine learning dataset preparations.