Datasets

3,148 machine learning datasets

3,148 dataset results

ToM QA

The data consists of a set of 3 task types and 4 question types, creating 12 total scenarios. The tasks are grouped into stories, which are denoted by the numbering at the start of each line.

1 papers0 benchmarksTexts

Gutenberg Poem Dataset

Gutenberg Poem Dataset is used for the next verse prediction component.

1 papers0 benchmarksTexts

Kite

The Kite database is a multi-modal dataset for the control of unmanned aerial vehicles (UAVs). There are three modalities present in the dataset:

1 papers0 benchmarksImages, Speech, Texts

The Part-Whole Relations dataset is a dataset of semantic relations between entities. It contains the following subtypes: - Component-Of - Member-Of - Portion-Of - Stuff-Of - Located-In - Contained-In - Phase-Of - Participates-In

1 papers0 benchmarksTexts

Processed Twitter

Processed Twitter is a dataset that is used for Twitter topic recognition. It contains tweets from 6 different topics.

1 papers0 benchmarksImages, Texts

OFEQ-10k

The OFEQ-10k dataset contains 12,548 detailed questions with corresponding math headlines from MathOverflow.

1 papers0 benchmarksTexts

JSS Dataset (Jejueo Single Speaker Speech)

The Jejueo Single Speaker Speech (JSS) dataset consists of 10k high-quality audio files recorded by a native Jejueo speaker and a transcript file.

1 papers0 benchmarksSpeech, Texts

needadvice

needadvice is a dataset for advice classification extracted from Reddit. In this dataset, posts are annotated for whether they contain advice or not. It contains 6,148 samples for training, 816 for validation and 898 for testing.

1 papers0 benchmarksTexts

Wiki-zh

Wiki-zh is an annotated Chinese dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN). It contains 26,280 documents split into training, validation and test.

1 papers0 benchmarksTexts

AuxAI

AuxAI is a distantly supervised dataset for acronym identification.

1 papers0 benchmarksTexts

PART-OF

The PART-OF dataset is a dataset of relations extracted from a medical ontology. The different entities in the ontology are parts of the human body. The dataset has 16,894 nodes with 19,436 edges between them.

1 papers0 benchmarksGraphs, Medical, Texts

CLaRO

CLaRO is a new dataset of 234 Competency Questions that had been processed automatically into 106 patterns. The coverage of CLaRO, with its 93 main templates and 41 linguistic variants, is about 90% for unseen questions.

1 papers0 benchmarksTexts

EPIC30M

EPIC30M contains a subset of 26.2 millions tweets related to three general diseases, namely Ebola, Cholera and Swine Flu, and another subset of 4.7 millions tweets of six global epidemic outbreaks, including 2009 H1N1 Swine Flu, 2010 Haiti Cholera, 2012 Middle-East Respiratory Syndrome (MERS), 2013 West African Ebola, 2016 Yemen Cholera and 2018 Kivu Ebola.

1 papers0 benchmarksTexts

WIKIOG

WIKIOG is a public collection which consists of over 1.75 million document-outline pairs for research on the OG task.

1 papers0 benchmarksTexts

SVLD (Social Vision and Language Dataset)

The social vision and language dataset is a large-scale multimodal dataset designed for research into social contextual learning.

1 papers0 benchmarksImages, Texts

CSI Screenplay Summarization Corpus

The dataset contains gold-standard summary labels for 39 "CSI: Crime Scene Investigation" episodes from seasons 1-5. Each episode contains the full-length screenplay and human annotations for its summary. The annotations include:

1 papers0 benchmarksTexts

The Best Sarcasm Annotated Dataset in Spanish

Content This dataset contains all utterances of two episodes of South Park (Latin American voices) and two episodes of Archer (Spanish voices). The order of the utterances is shuffled. Each utterance has been annotated based on whether it is sarcastic or not. Sarcastic expressions also contain further annotation based on different theories on sarcasm.

1 papers0 benchmarksTexts

PatentMatch

We address the computer-assisted search for prior art by creating a training dataset for supervised machine learning called PatentMatch. It contains pairs of claims from patent applications and semantically corresponding text passages of different degrees from cited patent documents. Each pair has been labeled by technically-skilled patent examiners from the European Patent Office. Accordingly, the label indicates the degree of semantic correspondence (matching), i.e., whether the text passage is prejudicial to the novelty of the claimed invention or not.

1 papers0 benchmarksTexts

A Dataset of Journalists' Interactions with Their Readership

We present a dataset of dialogs in which journalists of The Guardian replied to reader comments and identify the reasons why. Based on this data, we formulate the novel task of recommending reader comments to journalists that are worth reading or replying to, i.e., ranking comments in such a way that the top comments are most likely to require the journalists' reaction.

1 papers0 benchmarksTexts

Top Comment or Flop Comment? (Top Comment or Flop Comment? User Engagement in Online News Discussions)

This dataset comprises four files of IDs of either strongly or weakly engaging online news comments (please see the paper for details): "Top comments" are 1) the top 10% comments in the politics section of The Guardian with the largest relative number of replies received (3111 samples) and 2) the top 10% comments in the politics section with the largest relative number of upvotes received (11081 samples)

1 papers0 benchmarksTexts

PreviousPage 107 of 158Next

Datasets

ToM QA

Gutenberg Poem Dataset

Kite

Part Whole Relations

Processed Twitter

OFEQ-10k

JSS Dataset (Jejueo Single Speaker Speech)

needadvice

Wiki-zh

AuxAI

PART-OF

CLaRO

EPIC30M

WIKIOG

SVLD (Social Vision and Language Dataset)

CSI Screenplay Summarization Corpus

The Best Sarcasm Annotated Dataset in Spanish

PatentMatch

A Dataset of Journalists' Interactions with Their Readership

Top Comment or Flop Comment? (Top Comment or Flop Comment? User Engagement in Online News Discussions)

Datasets

ToM QA

Gutenberg Poem Dataset

Kite

Part Whole Relations

Processed Twitter

OFEQ-10k

JSS Dataset (Jejueo Single Speaker Speech)

needadvice

Wiki-zh

AuxAI

PART-OF

CLaRO

EPIC30M

WIKIOG

SVLD (Social Vision and Language Dataset)

CSI Screenplay Summarization Corpus

The Best Sarcasm Annotated Dataset in Spanish

PatentMatch

A Dataset of Journalists' Interactions with Their Readership

Top Comment or Flop Comment? (Top Comment or Flop Comment? User Engagement in Online News Discussions)