PPC

Polish Paraphrase Corpus

Introduced 2022-07-26

The Polish Paraphrase Corpus (PPC) is a dataset consisting of 7000 manually labeled sentence pairs in Polish. The purpose of creating this dataset was to verify how machine learning models perform in the challenging problem of paraphrase identification, where most records contain semantically overlapping parts. The dataset was divided into training, validation, and test splits, and each record was assigned to one of three categories: exact paraphrases, close paraphrases, or non-paraphrases. The corpus was created by automatically generating candidate pairs and then manually labeling them. The extracted sentence pairs were drawn from different data sources, including Taboeba, Polish news articles, Wikipedia, and the Polish version of the SICK dataset.