3,148 machine learning datasets
3,148 dataset results
Models character profiles and gives dialogue agents the ability to learn characters' language styles through their HLAs.
This is a dataset for disentangling conversations on IRC, which is the task of identifying separate conversations in a single stream of messages. It contains disentanglement information for 77,563 messages or IRC.
The IS-A dataset is a dataset of relations extracted from a medical ontology. The different entities in the ontology are related by the “is a” relation. For example, ‘acute leukemia’ is a ‘leukemia’. The dataset has 294,693 nodes with 356,541 edges between them.
Pn-summary is a dataset for Persian abstractive text summarization.
Aims to help V-NLIs recognize analytic tasks from free-form natural language by training and evaluating cutting-edge multi-label classification models. The dataset contains diverse user queries, and each is annotated with one or multiple analytic tasks.
Collects dense per-video-shot concept annotations.
Romanian Named Entity Corpus is a named entity corpus for the Romanian language. The corpus contains over 26000 entities in ~5000 annotated sentences, belonging to 16 distinct classes. The sentences have been extracted from a copy-right free newspaper, covering several styles. This corpus represents the first initiative in the Romanian language space specifically targeted for named entity recognition.
A challenging machine comprehension corpus with multiple-choice questions, intended for research on the machine comprehension of Vietnamese text. This corpus includes 2,783 multiple-choice questions and answers based on a set of 417 Vietnamese texts used for teaching reading comprehension for 1st to 5th graders. Answers may be extracted from the contents of single or multiple sentences in the corresponding reading text.
The WikiSem500 dataset contains around 500 per-language cluster groups for English, Spanish, German, Chinese, and Japanese (a total of 13,314 test cases).
A large-scale dataset built on questions from TyDi QA lacking same-language answers.
SRL is the task of extracting semantic predicate-argument structures from sentences. X-SRL is a multilingual parallel Semantic Role Labelling (SRL) corpus for English (EN), German (DE), French (FR) and Spanish (ES) that is based on English gold annotations and shares the same labelling scheme across languages.
CC-DBP is a dataset for knowledge base population research using Common Crawl and DBpedia.
A dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language.
MuMu is a new dataset of more than 31k albums classified into 250 genre classes.
ARC Direct Answer Questions (ARC-DA) dataset consists of 2,985 grade-school level, direct-answer ("open response", "free form") science questions derived from the ARC multiple-choice question set released as part of the AI2 Reasoning Challenge in 2018.
Advising Corpus is a dataset based on an entirely new collection of dialogues in which university students are being advised which classes to take. These were collected at the University of Michigan with IRB approval. They were released as part of DSTC 7 track 1 and used again in DSTC 8 track 2.
Finnish Paraphrase Corpus is a fully manually annotated paraphrase corpus for Finnish containing 53,572 paraphrase pairs harvested from alternative subtitles and news headings. Out of all paraphrase pairs in the corpus 98% are manually classified to be paraphrases at least in their given context, if not in all contexts.
WEC-eng is a cross-document event coreference resolution dataset extracted from English Wikipedia. Coreference links are not restricted within predefined topics. The training set includes 40,529 mentions distributed into 7,042 coreference clusters.
This is a dataset for intent detection and slot filling for the Vietnamese language. The dataset consists of 5,871 gold annotated utterances with 28 intent labels and 82 slot types.
MS^2 (Multi-Document Summarization of Medical Studies) is a dataset of over 470k documents and 20k summaries derived from the scientific literature. This dataset facilitates the development of systems that can assess and aggregate contradictory evidence across multiple studies, and is one of the first large-scale, publicly available multi-document summarization dataset in the biomedical domain.