3,148 machine learning datasets
3,148 dataset results
The data consists of a set of 3 task types and 4 question types, creating 12 total scenarios. The tasks are grouped into stories, which are denoted by the numbering at the start of each line.
Gutenberg Poem Dataset is used for the next verse prediction component.
The Kite database is a multi-modal dataset for the control of unmanned aerial vehicles (UAVs). There are three modalities present in the dataset:
The Part-Whole Relations dataset is a dataset of semantic relations between entities. It contains the following subtypes: - Component-Of - Member-Of - Portion-Of - Stuff-Of - Located-In - Contained-In - Phase-Of - Participates-In
Processed Twitter is a dataset that is used for Twitter topic recognition. It contains tweets from 6 different topics.
The OFEQ-10k dataset contains 12,548 detailed questions with corresponding math headlines from MathOverflow.
The Jejueo Single Speaker Speech (JSS) dataset consists of 10k high-quality audio files recorded by a native Jejueo speaker and a transcript file.
needadvice is a dataset for advice classification extracted from Reddit. In this dataset, posts are annotated for whether they contain advice or not. It contains 6,148 samples for training, 816 for validation and 898 for testing.
Wiki-zh is an annotated Chinese dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN). It contains 26,280 documents split into training, validation and test.
AuxAI is a distantly supervised dataset for acronym identification.
The PART-OF dataset is a dataset of relations extracted from a medical ontology. The different entities in the ontology are parts of the human body. The dataset has 16,894 nodes with 19,436 edges between them.
CLaRO is a new dataset of 234 Competency Questions that had been processed automatically into 106 patterns. The coverage of CLaRO, with its 93 main templates and 41 linguistic variants, is about 90% for unseen questions.
EPIC30M contains a subset of 26.2 millions tweets related to three general diseases, namely Ebola, Cholera and Swine Flu, and another subset of 4.7 millions tweets of six global epidemic outbreaks, including 2009 H1N1 Swine Flu, 2010 Haiti Cholera, 2012 Middle-East Respiratory Syndrome (MERS), 2013 West African Ebola, 2016 Yemen Cholera and 2018 Kivu Ebola.
WIKIOG is a public collection which consists of over 1.75 million document-outline pairs for research on the OG task.
The social vision and language dataset is a large-scale multimodal dataset designed for research into social contextual learning.
The dataset contains gold-standard summary labels for 39 "CSI: Crime Scene Investigation" episodes from seasons 1-5. Each episode contains the full-length screenplay and human annotations for its summary. The annotations include:
Content This dataset contains all utterances of two episodes of South Park (Latin American voices) and two episodes of Archer (Spanish voices). The order of the utterances is shuffled. Each utterance has been annotated based on whether it is sarcastic or not. Sarcastic expressions also contain further annotation based on different theories on sarcasm.
We address the computer-assisted search for prior art by creating a training dataset for supervised machine learning called PatentMatch. It contains pairs of claims from patent applications and semantically corresponding text passages of different degrees from cited patent documents. Each pair has been labeled by technically-skilled patent examiners from the European Patent Office. Accordingly, the label indicates the degree of semantic correspondence (matching), i.e., whether the text passage is prejudicial to the novelty of the claimed invention or not.
We present a dataset of dialogs in which journalists of The Guardian replied to reader comments and identify the reasons why. Based on this data, we formulate the novel task of recommending reader comments to journalists that are worth reading or replying to, i.e., ranking comments in such a way that the top comments are most likely to require the journalists' reaction.
This dataset comprises four files of IDs of either strongly or weakly engaging online news comments (please see the paper for details): "Top comments" are 1) the top 10% comments in the politics section of The Guardian with the largest relative number of replies received (3111 samples) and 2) the top 10% comments in the politics section with the largest relative number of upvotes received (11081 samples)