Machine Prarphrase Corpus (MPC)

TextsAttribution 4.0 InternationalIntroduced 2021-03-22

This dataset is used to train and evaluate models for the detection of machine-paraphrased text.

The training set consists of 200,767 paragraphs (98,282 original, 102,485 paraphrased) extracted from 8,024 Wikipedia (English) articles (4,012 original, 4,012 paraphrased using the SpinBot API).

The test set is divided into 3 subsets: one created from preprints of research papers on arXiv, one from graduation theses, and one from Wikipedia articles. Additionally, different marchine-paraphrasing methods were used.

Test sets:

SpinBot: 
    arXiv         - Original - 20,966;    Spun - 20,867
    Theses        - Original - 5,226;        Spun - 3,463
    Wikipedia    - Original - 39,241;    Spun - 40,729
    
SpinnerChief-4W: 
    arXiv         - Original - 20,966;    Spun - 21,671
    Theses        - Original - 2,379;        Spun - 2,941
    Wikipedia    - Original - 39,241;    Spun - 39,618
    
SpinnerChief-2W: 
    arXiv         - Original - 20,966;    Spun - 21,719
    Theses        - Original - 2,379;        Spun - 2,941
    Wikipedia    - Original - 39,241;    Spun - 39,697