3,148 machine learning datasets
3,148 dataset results
Czech restaurant information is a dataset for NLG in task-oriented spoken dialogue systems with Czech as the target language. It originated as a translation of the English San Francisco Restaurants dataset by Wen et al. (2015).
Includes gold-standard labels for identifying statements of desire, textual evidence for desire fulfillment, and annotations for whether the stated desire is fulfilled given the evidence in the narrative context.
The Dialogue Fairness dataset is used to evaluate and understand fairness in dialogue models, focusing on gender and racial biases.
An open corpus of Scientific Research papers which has a representative sample from across scientific disciplines. This corpus not only includes the full text of the article, but also the metadata of the documents, along with the bibliographic information for each reference.
The Freebase Annotations of TREC KBA 2014 Stream Corpus with Timestamps (FAKBAT) is an extension of the FAKBA1 dataset that contains entity age and entity timestamp. It comprises roughly 1.2 billion timestamped documents from global public news wires, blogs, forums, and shortened links shared on social media. It spans 572 days (October 7, 2011–May 1, 2013).
This dataset enriches the benchmark Room-to-Room (R2R) dataset by dividing the instructions into sub-instructions and pairing each of those with their corresponding viewpoints in the path. The overall instruction and trajectory of each sample remains the same.
FinnSentiment introduces a 27,000 sentence dataset (in Finnish) annotated independently with sentiment polarity by three native annotators.
This dataset is dialog dataset collected in a Wizard-of-Oz fashion. Two humans talked to each other via a chat interface. One was playing the role of the user and the other one was playing the role of the conversational agent. The latter is called a wizard as a reference to the Wizard of Oz, the man behind the curtain. The wizards had access to a database of 250+ packages, each composed of a hotel and round-trip flights. The users were asked to find the best deal. This resulted in complex dialogues where a user would often consider different options, compare packages, and progressively build the description of her ideal trip.
The Horne 2017 Fake News Data contains two independed news datasets:
A large-scale evaluation dataset for headlines of three different lengths composed by professional editors.
This dataset contains information about Japanese word similarity including rare words. The dataset is constructed following the Stanford Rare Word Similarity Dataset. 10 annotators annotated word pairs with 11 levels of similarity.
The Jejueo Interview Transcripts (JIT) dataset is a parallel corpus containing 170k+ Jejueo-Korean sentences.
The Live Comment Dataset is a large-scale dataset with 2,361 videos and 895,929 live comments that were written while the videos were streamed.
A question answering dataset from the dairy domain dedicated to the study of consumer questions. The dataset contains 2,657 pairs of questions and answers, written in the Portuguese language and originally collected by the Brazilian Agricultural Research Corporation (Embrapa). All questions were motivated by real situations and written by thousands of authors with very different backgrounds and levels of literacy, while answers were elaborated by specialists from Embrapa's customer service.
Persian-English parallel corpus with more than one million sentence pairs collected from masterpieces of literature.
MultiReQA is a cross-domain evaluation for retrieval question answering models. Retrieval question answering (ReQA) is the task of retrieving a sentence-level answer to a question from an open corpus. MultiReQA is a new multi-domain ReQA evaluation suite composed of eight retrieval QA tasks drawn from publicly available QA datasets from the MRQA shared task. MultiReQA contains the sentence boundary annotation from eight publicly available QA datasets including SearchQA, TriviaQA, HotpotQA, NaturalQuestions, SQuAD, BioASQ, RelationExtraction, and TextbookQA. Five of these datasets, including SearchQA, TriviaQA, HotpotQA, NaturalQuestions, SQuAD, contain both training and test data, and three, in cluding BioASQ, RelationExtraction, TextbookQA, contain only the test data.
NewsPH-NLI is a sentence entailment benchmark dataset in the low-resource Filipino language.
The pioNER corpus provides gold-standard and automatically generated named-entity datasets for the Armenian language. The automatically generated corpus is generated from Wikipedia. The gold-standard set is a collection of over 250 news articles from iLur.am with manual named-entity annotation. It includes sentences from political, sports, local and world news, and is comparable in size with the test sets of other languages.
PoKi is a corpus of 61,330 poems written by children from grades 1 to 12. PoKi is especially useful in studying child language because it comes with information about the age of the child authors (their grade).
The Pump and dump dataset is an annotated set of messages to detect cryptocurrency market manipulations. It consists of a list of a list of pump and dumps arranged by groups on Telegram. All the pump and dumps in the dataset are on the trading pair SYM/BTC.