3,148 machine learning datasets
3,148 dataset results
This dataset is a new knowledge-base (KB) of hasPart relationships, extracted from a large corpus of generic statements. Complementary to other resources available, it is the first which is all three of: accurate (90% precision), salient (covers relationships a person may mention), and has high coverage of common terms (approximated as within a 10 year old’s vocabulary), as well as having several times more hasPart entries than in the popular ontologies ConceptNet and WordNet. In addition, it contains information about quantifiers, argument modifiers, and links the entities to appropriate concepts in Wikipedia and WordNet.
PHM2017 is a new dataset consisting of 7,192 English tweets across six diseases and conditions: Alzheimer’s Disease, heart attack (any severity), Parkinson’s disease, cancer (any type), Depression (any severity), and Stroke. The Twitter search API was used to retrieve the data using the colloquial disease names as search keywords, with the expectation of retrieving a high-recall, low precision dataset. After removing the re-tweets and replies, the tweets were manually annotated. The labels are:
IQUAD is a dataset for Visual Question Answering in interactive environments. It is built upon AI2-THOR, a simulated photo-realistic environment of configurable indoor scenes with interactive object. IQUAD V1 has 75,000 questions, each paired with a unique scene configuration.
OPOSUM is a dataset for the training and evaluation of Opinion Summarization models which contains Amazon reviews from six product domains: Laptop Bags, Bluetooth Headsets, Boots, Keyboards, Televisions, and Vacuums. The six training collections were created by downsampling from the Amazon Product Dataset introduced in McAuley et al. (2015) and contain reviews and their respective ratings.
A large-scale corpus of Gulf Arabic consisting of 110 million words from 1,200 forum novels.
Hindi Visual Genome is a multimodal dataset consisting of text and images suitable for English-Hindi multimodal machine translation task and multimodal research.
The Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.
In MutualFriends, two agents, A and B, each have a private knowledge base, which contains a list of friends with multiple attributes (e.g., name, school, major, etc.). The agents must chat with each other to find their unique mutual friend.
This dataset is a collection of spoken language instructions for a robotic system to pick and place common objects. Text instructions and corresponding object images are provided. The dataset consists of situations where the robot is instructed by the operator to pick up a specific object and move it to another location: for example, Move the blue and white tissue box to the top right bin. This dataset consists of RGBD images, bounding box annotations, destination box annotations, and text instructions.
Reddit Corpus is part of a repository of conversational datasets consisting of hundreds of millions of examples, and a standardised evaluation procedure for conversational response selection models using '1-of-100 accuracy'. The Reddit Corpus contains 726 million multi-turn dialogues from the Reddit board.
UIT-ViCTSD (Vietnamese Constructive and Toxic Speech Detection) is a dataset for constructive and toxic speech detection in Vietnamese. It consists of 10,000 human-annotated comments.
KUAKE Query Title Relevance, a dataset used to estimate the relevance of the title of a query document, is used for the KUAKE-QTR task. Given a query (e.g., “Symptoms of vitamin B deficiency”), the task aims to find the relevant title (e.g., “The main manifestations of vitamin B deficiency”).
KUAKE Query-Query Relevance, a dataset used to evaluate the relevance of the content expressed in two queries, is used for the KUAKE-QQR task. Similar to KUAKE-QTR, the task aims to estimate query-query relevance, which is an essential and challenging task in real-world search engines.
DaNetQA is a question answering dataset for yes/no questions. These questions are naturally occurring ---they are generated in unprompted and unconstrained settings.
A Winograd schema is a pair of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two sentences and requires the use of world knowledge and reasoning for its resolution. The schema takes its name from a well-known example by Terry Winograd.
Textual Entailment Recognition has been proposed recently as a generic task that captures major semantic inference needs across many NLP applications, such as Question Answering, Information Retrieval, Information Extraction, and Text Summarization. This task requires to recognize, given two text fragments, whether the meaning of one text is entailed (can be inferred) from the other text.
Choice of Plausible Alternatives for Russian language (PARus) evaluation provides researchers with a tool for assessing progress in open-domain commonsense causal reasoning. Each question in PARus is composed of a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise. The correct alternative is randomized so that the expected performance of randomly guessing is 50%.
SNARE, short for ShapeNet Annotated with Referring Expressions, is a benchmark requires a model to choose which of two objects is being referenced by a natural language description.
This dataset provides a new split of VQA v2 (similarly to VQA-CP v2), which is built of questions that are hard to answer for biased models.
BiSECT is a dataset for sentence simplification, which is the ability to take a long, complex sentence and split it into shorter sentences, rephrasing as necessary. BiSECT training data consists of 1 million long English sentences paired with shorter, meaning-equivalent English sentences. These were obtained by extracting 1-2 sentence alignments in bilingual parallel corpora and then using machine translation to convert both sides of the corpus into the same language.