Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

3,148 machine learning datasets

Filter by Modality

3,148 dataset results

BioLeaflets

BioLeaflets is a biomedical dataset for Data2Text generation. It is a corpus of 1,336 package leaflets of medicines authorised in Europe, which were obtained by scraping the European Medicines Agency (EMA) website. Package leaflets are included in the packaging of medicinal products and contain information to help patients use the product safely and appropriately, under the guidance of their healthcare professional. Each document contains six sections: 1) What is the product and what is it used for 2) What you need to know before you take the product 3) product usage instructions 4) possible side effects, 5) product storage conditions 6) other information.

1 papers0 benchmarksBiomedical, Texts

PMPC (Persona Match on Persona-Chat)

PMPC (Persona Match on Persona-Chat) is a dataset for Speaker Persona Detection (SPD) which aims to detect speaker personas based on the plain conversational text.

1 papers0 benchmarksTexts

TREx-2p

TREx-2p is a dataset to probe whether a pretrained LM possesses “indirect” 2-hop knowledge. It is a 2-hop variant of the T-REx dataset. It has been built by manually examining the 2-hop link existing in the knowledge graph of TREx-1p, and select eight 2- hop relation types that make sense to humans

1 papers0 benchmarksTexts

ComSum

ComSum is a data set of 7 million commit messages for text summarization. When documenting commits, software code changes, both a message and its summary are posted. These messages are gathered and filtered to curate developers' work summarization data set.

1 papers0 benchmarksTexts

Medical Wiki Paralell Corpus for Medical Text Simplification

A medical Wiki paralell corpus for medical text simplification.

1 papers0 benchmarksTexts

E-Manual Corpus

E-Manual Corpus is a corpus of 307,957 E-manuals, used for pre-training models for Question Answering on e-manuals.

1 papers0 benchmarksTexts

BLANCA

BLANCA (Benchmarks for LANguage models on Coding Artifacts) is a collection of benchmarks that assess code understanding based on tasks such as predicting the best answer to a question in a forum post, finding related forum posts, or predicting classes related in a hierarchy from class documentation.

1 papers0 benchmarksTexts

ELITR ECA

The ELITR ECA corpus is a multilingual corpus derived from publications of the European Court of Auditors. We use automatic translation together with Bleualign to identify parallel sentence pairs in all 506 translation directions. The result is a corpus comprising 264k document pairs and 41.9M sentence pairs.

1 papers0 benchmarksTexts

wikiHow-image

The dataset consists of 53,189 wikiHow articles across various categories of everyday tasks, 155,265 methods, and 772,294 steps with corresponding images.

1 papers1 benchmarksImages, Texts

TVRecap

TVRecap a story generation dataset that requires generating detailed TV show episode recaps from a brief summary and a set of documents describing the characters involved. Unlike other story generation datasets, TVRecap contains stories that are authored by professional screenwriters and that feature complex interactions among multiple characters. Generating stories in TVRecap requires drawing relevant information from the lengthy provided documents about characters based on the brief summary. In addition, by swapping the input and output, TVRecap can serve as a challenging testbed for abstractive summarization.

1 papers0 benchmarksTexts

CI-ToD

CI-ToD is a dataset for Consistency Identification in Task-oriented Dialog system.

1 papers0 benchmarksTexts

FewGLUE_64_labeled (A new version of FewGLUE with 64 training examples)

Introduction The FewGLUE_64_labeled dataset is a new version of FewGLUE dataset. It contains a 64-sample training set, a development set (the original SuperGLUE development set), a test set, and an unlabeled set. It is constructed to facilitate the research of few-shot learning for natural language understanding tasks.

1 papers0 benchmarksTexts

VQA-MHUG

VQA-MHUG is a 49-participant dataset of multimodal human gaze on both images and questions during visual question answering (VQA) collected using a high-speed eye tracker.

1 papers0 benchmarksImages, Texts

JDDC 2.0

JDDC 2.0 is a large-scale multimodal multi-turn dialogue dataset collected from a mainstream Chinese E-commerce platform JD.com, containing about 246 thousand dialogue sessions, 3 million utterances, and 507 thousand images, along with product knowledge bases and image category annotations. The dataset is divided into the training set, the validation set, and the test set according to the ratio of 80%, 10%, and 10%.

1 papers0 benchmarksTexts

SCIMAT

SCIMAT is a large question-answer dataset for mathematics and science problems; such dataset can have impact on online education, intelligent tutoring and automated grading.

1 papers0 benchmarksTexts

Contextualised Polyseme Word Sense Dataset v2

This is a revised and extended second version of a Contextualised Polyseme Word Sense Dataset. The dataset contains two human annotated measures of word sense similarity for polysemic target words used in contexts invoking different sense interpretations. The first set contains graded similarity judgements for highlighted target words displayed in two different contexts. The second set contains co-predication acceptability judgements for sentence constructions combining the sentence pairs from the first set.

1 papers0 benchmarksTexts

AraCovid19-SSD

AraCovid19-SSD is a manually annotated Arabic COVID-19 sarcasm and sentiment detection dataset containing 5,162 tweets.

1 papers0 benchmarksTexts

HowSumm

HowSumm is a large-scale query-focused multi-document summarization dataset. It is focused on summarization of various sources to create HowTo guides. It is derived from wikiHow articles.

1 papers0 benchmarksTexts

TBCOV

TBCOV is a large-scale Twitter dataset comprising more than two billion multilingual tweets related to the COVID-19 pandemic collected worldwide over a continuous period of more than one year. Several state-of-the-art deep learning models are used to enrich the data with important attributes, including sentiment labels, named-entities (e.g., mentions of persons, organizations, locations), user types, and gender information. A geotagging method is proposed to assign country, state, county, and city information to tweets, enabling a myriad of data analysis tasks to understand real-world issues.

1 papers0 benchmarksTexts

STR-2021

The STR-2021 dataset has 5,500 English sentence pairs manually annotated for semantic relatedness using a comparative annotation framework.

1 papers0 benchmarksTexts

PreviousPage 113 of 158Next