3,148 machine learning datasets
3,148 dataset results
WordNet-feelings, is an affective dataset that identifies 3664 word senses as feelings, and associates each of these with one of the 9 categories of feeling. The 9 different categories are: Actions, Anger, Attention, Attraction, Hedonics, Other, Physiological, Social, Wellbeing.
TurkQA consists of a selection of sentences from English Wikipedia articles, with questions and answers crowdsourced from workers on Amazon Mechanical Turk.
Dialog-based Language Learning dataset is designed to measure how well models can perform at learning as a student given a teacher’s textual responses to the student’s answer (as well as potentially receiving an external real-valued reward signal).
To collect WikiSuggest, Google Suggest API is used to harvest natural language questions and submit them to Google Search. Whenever Google Search returns a box with a short answer from Wikipedia, an example from the question, answer, and the Wikipedia document are created. If the answer string is missing from the document this often implies a spurious question-answer pair, such as (‘what time is half time in rugby’, ‘80 minutes, 40 minutes’). Question-answer pairs without the exact answer string are pruned. Fifty examples after filtering are examined and 54% were found to be well-formed question-answer pairs where answers in the document can be grounded, 20% contained answers without textual evidence in the document (the answer string exists in an irreleveant context), and 26% contain incorrect QA pairs.
AAVE/SAE Paired Dataset contains 2019 intent-equivalent AAVE/SAE pairs. The AAVE (African-American Vernacular English) samples are sampled from Blodgett et. al. (2016)'s TwitterAAE, with their corresponding SAE (Standard American English) samples annotated by Amazon MTurk.
The Advice-Seeking Questions (ASQ) dataset is a collection of personal narratives with advice-seeking questions. The dataset has been split into train, test, heldout sets, with 8865, 2500, 10000 test instances each. This dataset is used to train and evaluate methods that can infer what is the advice-seeking goal behind a personal narrative. This task is formulated as a cloze test, where the goal is to identify which of two advice-seeking questions was removed from a given narrative.
Consists of 7.5k sentences with gapping (as well as 15k relevant negative sentences) and comprises data from various genres: news, fiction, social media and technical texts. The dataset was prepared for the Automatic Gapping Resolution Shared Task for Russian (AGRR-2019) - a competition aimed at stimulating the development of NLP tools and methods for processing of ellipsis.
The Alexa Point of View dataset is point of view conversion dataset, a parallel corpus of messages spoken to a virtual assistant and the converted messages for delivery. The dataset contains parallel corpus of input (input column) message and POV converted messages (output column). An example of a pair is tell @CN@ that i'll be late [\t] hi @CN@, @SCN@ would like you to know that they'll be late. The input and pov-converted output pair is tab separated. @CN@ tag is a placeholder for the contact name (receiver) and @SCN@ tag is a placeholder for source contact name (sender). The total dataset has 46563 pairs. This data is then test/train/dev split into 6985 pairs/32594 pairs/6985 pairs.
Contains a large number of online videos and subtitles.
This dataset contains 2,360 paraphrases in Armenian that can be used for paraphrase detection. The dataset is constructed by back-translating sentences from Armenian to English twice, and manually filtering the result.
AskParents is a dataset for advice classification extracted from Reddit. In this dataset, posts are annotated for whether they contain advice or not. It contains 8,701 samples for training, 802 for validation and 1,091 for testing.
This dataset is used to evaluate a predictive consent model for users’ information shared in social media. In this task, the goal is to predict whether the users will give their consent to share that data with different hypothetical audiences within a medical context. The dataset is built from information the users posted on Facebook and their consent answers about each piece of information.
AuxAD is a a distantly supervised dataset for acronym disambiguation.
Dataset for evaluating English-Chinese Bilingual Contextual Word Similarity. The dataset consists of 2,091 English-Chinese word pairs with the corresponding sentential contexts and their similarity scores annotated by the human.
Bianet is a parallel news corpus in Turkish, Kurdish and English It contains 3,214 Turkish articles with their sentence-aligned Kurdish or English translations from the Bianet online newspaper.
The CECW dataset is a color-extended version of the Cleanup World (CW) borrowed from the mobile-manipulation robot domain. CW refers to a world equipped with a movable object as well as four rooms in four colors, including "blue," "green," "red," and "yellow," which is designed as a simulation environment where the agent can act based on the instructions received. CW obeys a particular Geometric Linear Temporal Logic (GLTL) to parse commands by grammatical syntax, resulting in a total of 3,382 commands reflecting 39 GLTL expressions.
Large-scale Chinese legal dataset for judgment prediction. \dataset contains more than 2.6 million criminal cases published by the Supreme People's Court of China, which are several times larger than other datasets in existing works on judgment prediction.
Chinese Literature NER RE is a Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text. It is constructed from hundreds of Chinese literature articles.
A diverse dataset of written code-switched productions, curated from topical threads of multiple bilingual communities on the Reddit discussion platform, and explore questions that were mainly addressed in the context of spoken language thus far.
The Composed Quora dataset consists of questions extracted from Quora that are grouped together if they are asking the same thing. The dataset contains 60,400 groups of questions, each group with at least 3 questions that are asking the same.