3,148 machine learning datasets
3,148 dataset results
BioLeaflets is a biomedical dataset for Data2Text generation. It is a corpus of 1,336 package leaflets of medicines authorised in Europe, which were obtained by scraping the European Medicines Agency (EMA) website. Package leaflets are included in the packaging of medicinal products and contain information to help patients use the product safely and appropriately, under the guidance of their healthcare professional. Each document contains six sections: 1) What is the product and what is it used for 2) What you need to know before you take the product 3) product usage instructions 4) possible side effects, 5) product storage conditions 6) other information.
PMPC (Persona Match on Persona-Chat) is a dataset for Speaker Persona Detection (SPD) which aims to detect speaker personas based on the plain conversational text.
TREx-2p is a dataset to probe whether a pretrained LM possesses “indirect” 2-hop knowledge. It is a 2-hop variant of the T-REx dataset. It has been built by manually examining the 2-hop link existing in the knowledge graph of TREx-1p, and select eight 2- hop relation types that make sense to humans
ComSum is a data set of 7 million commit messages for text summarization. When documenting commits, software code changes, both a message and its summary are posted. These messages are gathered and filtered to curate developers' work summarization data set.
A medical Wiki paralell corpus for medical text simplification.
E-Manual Corpus is a corpus of 307,957 E-manuals, used for pre-training models for Question Answering on e-manuals.
BLANCA (Benchmarks for LANguage models on Coding Artifacts) is a collection of benchmarks that assess code understanding based on tasks such as predicting the best answer to a question in a forum post, finding related forum posts, or predicting classes related in a hierarchy from class documentation.
The ELITR ECA corpus is a multilingual corpus derived from publications of the European Court of Auditors. We use automatic translation together with Bleualign to identify parallel sentence pairs in all 506 translation directions. The result is a corpus comprising 264k document pairs and 41.9M sentence pairs.
The dataset consists of 53,189 wikiHow articles across various categories of everyday tasks, 155,265 methods, and 772,294 steps with corresponding images.
TVRecap a story generation dataset that requires generating detailed TV show episode recaps from a brief summary and a set of documents describing the characters involved. Unlike other story generation datasets, TVRecap contains stories that are authored by professional screenwriters and that feature complex interactions among multiple characters. Generating stories in TVRecap requires drawing relevant information from the lengthy provided documents about characters based on the brief summary. In addition, by swapping the input and output, TVRecap can serve as a challenging testbed for abstractive summarization.
CI-ToD is a dataset for Consistency Identification in Task-oriented Dialog system.
Introduction The FewGLUE_64_labeled dataset is a new version of FewGLUE dataset. It contains a 64-sample training set, a development set (the original SuperGLUE development set), a test set, and an unlabeled set. It is constructed to facilitate the research of few-shot learning for natural language understanding tasks.
VQA-MHUG is a 49-participant dataset of multimodal human gaze on both images and questions during visual question answering (VQA) collected using a high-speed eye tracker.
JDDC 2.0 is a large-scale multimodal multi-turn dialogue dataset collected from a mainstream Chinese E-commerce platform JD.com, containing about 246 thousand dialogue sessions, 3 million utterances, and 507 thousand images, along with product knowledge bases and image category annotations. The dataset is divided into the training set, the validation set, and the test set according to the ratio of 80%, 10%, and 10%.
SCIMAT is a large question-answer dataset for mathematics and science problems; such dataset can have impact on online education, intelligent tutoring and automated grading.
This is a revised and extended second version of a Contextualised Polyseme Word Sense Dataset. The dataset contains two human annotated measures of word sense similarity for polysemic target words used in contexts invoking different sense interpretations. The first set contains graded similarity judgements for highlighted target words displayed in two different contexts. The second set contains co-predication acceptability judgements for sentence constructions combining the sentence pairs from the first set.
AraCovid19-SSD is a manually annotated Arabic COVID-19 sarcasm and sentiment detection dataset containing 5,162 tweets.
HowSumm is a large-scale query-focused multi-document summarization dataset. It is focused on summarization of various sources to create HowTo guides. It is derived from wikiHow articles.
TBCOV is a large-scale Twitter dataset comprising more than two billion multilingual tweets related to the COVID-19 pandemic collected worldwide over a continuous period of more than one year. Several state-of-the-art deep learning models are used to enrich the data with important attributes, including sentiment labels, named-entities (e.g., mentions of persons, organizations, locations), user types, and gender information. A geotagging method is proposed to assign country, state, county, and city information to tweets, enabling a myriad of data analysis tasks to understand real-world issues.
The STR-2021 dataset has 5,500 English sentence pairs manually annotated for semantic relatedness using a comparative annotation framework.