Datasets

3,148 machine learning datasets

3,148 dataset results

EXEQ-300k

The EXEQ-300k dataset contains 290,479 detailed questions with corresponding math headlines from Mathematics Stack Exchange. The dataset can be used to generate concise math headlines from detailed math questions.

3 papers0 benchmarksTexts

GeBioCorpus

A high-quality dataset for machine translation evaluation that aims at being one of the first non-synthetic gender-balanced test datasets.

3 papers0 benchmarksTexts

IndicNLP Corpus

The IndicNLP corpus is a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families.

3 papers0 benchmarksTexts

IWSLT 2019

The IWSLT 2019 dataset contains source, Machine Translated, reference and Post-Edited text, which can be used to quantify and evaluate Post-editing effort after automatic MT.

3 papers0 benchmarksTexts

ODSQA (Open-Domain Spoken Question Answering)

The ODSQA dataset is a spoken dataset for question answering in Chinese. It contains more than three thousand questions from 20 different speakers.

3 papers0 benchmarksAudio, Texts

OGTD (Offensive Greek Tweet Dataset)

A manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive.

3 papers0 benchmarksTexts

PerKey

A corpus of 553k news articles from six Persian news websites and agencies with relatively high quality author extracted keyphrases, which is then filtered and cleaned to achieve higher quality keyphrases.

3 papers0 benchmarksTexts

SPIRS

A first-of-its-kind large dataset of sarcastic/non-sarcastic tweets with high-quality labels and extra features: (1) sarcasm perspective labels (2) new contextual features. The dataset is expected to advance sarcasm detection research.

3 papers0 benchmarksTexts

WikiDocEdits

A dataset of single-sentence edits crawled from Wikipedia.

3 papers0 benchmarksTexts

Wikipedia Title

Wikipedia Title is a dataset for learning character-level compositionality from the character visual characteristics. It consists of a collection of Wikipedia titles in Chinese, Japanese or Korean labelled with the category to which the article belongs.

3 papers0 benchmarksTexts

WikiText-TL-39

WikiText-TL-39 is a benchmark language modeling dataset in Filipino that has 39 million tokens in the training set.

3 papers0 benchmarksTexts

WiLI-2018

WiLI-2018 is a benchmark dataset for monolingual written natural language identification. WiLI-2018 is a publicly available, free of charge dataset of short text extracts from Wikipedia. It contains 1000 paragraphs of 235 languages, totaling in 23500 paragraphs. WiLI is a classification dataset: Given an unknown paragraph written in one dominant language, it has to be decided which language it is.

3 papers0 benchmarksTexts

Youtubean

Youtbean is a dataset created from closed captions of YouTube product review videos. It can be used for aspect extraction and sentiment classification.

3 papers0 benchmarksTexts

SPOT (Sentiment Polarity Annotations Dataset)

The SPOT dataset contains 197 reviews originating from the Yelp'13 and IMDB collections (1), annotated with segment-level polarity labels (positive/neutral/negative). Annotations have been gathered on 2 levels of granulatiry:

3 papers0 benchmarksTexts

MDID (Multimodal Document Intent Dataset)

The Multimodal Document Intent Dataset (MDID) is a dataset for computing author intent from multimodal data from Instagram. It contains 1,299 Instagram posts covering a variety of topics, annotated with labels from three taxonomies. The samples are labelled with 7 labels of intent: Provocative, Informative, Advocative, Entertainment, Expositive, Expressive, Promotive

3 papers0 benchmarksImages, Texts

ADE-Affordance

ADE-Affordance is a new dataset that builds upon ADE20k, which contains annotations enabling such rich visual reasoning.

3 papers0 benchmarksImages, Texts

Famulus

This is a dataset for segmentation and classification of epistemic activities in diagnostic reasoning texts.

3 papers0 benchmarksTexts

BuzzFeed-Webis Fake News Corpus 2016

The BuzzFeed-Webis Fake News Corpus 16 comprises the output of 9 publishers in a week close to the US elections. Among the selected publishers are 6 prolific hyperpartisan ones (three left-wing and three right-wing), and three mainstream publishers (see Table 1). All publishers earned Facebook’s blue checkmark, indicating authenticity and an elevated status within the network. For seven weekdays (September 19 to 23 and September 26 and 27), every post and linked news article of the 9 publishers was fact-checked by professional journalists at BuzzFeed. In total, 1,627 articles were checked, 826 mainstream, 256 left-wing and 545 right-wing. The imbalance between categories results from differing publication frequencies.

3 papers0 benchmarksTexts

FakeNewsAMT & Celebrity

FakeNewsAMT & Celebrity include two novel datasets for the task of fake news detection, covering seven different news domains.

3 papers0 benchmarksTexts

PNT (Parsing Time Normalizations)

The Parsing Time Normalizations (PNT) corpus in SCATE format allows the representation of a wider variety of time expressions than previous approaches. This corpus was release with SemEval 2018 Task 6.

3 papers1 benchmarksTexts

PreviousPage 75 of 158Next