Papers With Code 2 | ML Benchmarks, SotA Results & Code

We have prepared a dataset, ParagraphOrdreing, which consists of around 300,000 paragraph pairs. We collected our data from Project Gutenberg. We have written an API for gathering and pre-processing in order to have the appropriate format for the defined task. Each example contains two paragraphs and a label that determines whether the second paragraph comes really after the first paragraph (true order with label 1) or the order has been reversed.

Data Statistics:

#Train Samples 294,265
#Test Samples 32,697
Unique Paragraphs 239,803
Average Number of Tokens 160.39
Average Number of Sentences 9.31