3,148 machine learning datasets
3,148 dataset results
The EXEQ-300k dataset contains 290,479 detailed questions with corresponding math headlines from Mathematics Stack Exchange. The dataset can be used to generate concise math headlines from detailed math questions.
A high-quality dataset for machine translation evaluation that aims at being one of the first non-synthetic gender-balanced test datasets.
The IndicNLP corpus is a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families.
The IWSLT 2019 dataset contains source, Machine Translated, reference and Post-Edited text, which can be used to quantify and evaluate Post-editing effort after automatic MT.
The ODSQA dataset is a spoken dataset for question answering in Chinese. It contains more than three thousand questions from 20 different speakers.
A manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive.
A corpus of 553k news articles from six Persian news websites and agencies with relatively high quality author extracted keyphrases, which is then filtered and cleaned to achieve higher quality keyphrases.
A first-of-its-kind large dataset of sarcastic/non-sarcastic tweets with high-quality labels and extra features: (1) sarcasm perspective labels (2) new contextual features. The dataset is expected to advance sarcasm detection research.
A dataset of single-sentence edits crawled from Wikipedia.
Wikipedia Title is a dataset for learning character-level compositionality from the character visual characteristics. It consists of a collection of Wikipedia titles in Chinese, Japanese or Korean labelled with the category to which the article belongs.
WikiText-TL-39 is a benchmark language modeling dataset in Filipino that has 39 million tokens in the training set.
WiLI-2018 is a benchmark dataset for monolingual written natural language identification. WiLI-2018 is a publicly available, free of charge dataset of short text extracts from Wikipedia. It contains 1000 paragraphs of 235 languages, totaling in 23500 paragraphs. WiLI is a classification dataset: Given an unknown paragraph written in one dominant language, it has to be decided which language it is.
Youtbean is a dataset created from closed captions of YouTube product review videos. It can be used for aspect extraction and sentiment classification.
The SPOT dataset contains 197 reviews originating from the Yelp'13 and IMDB collections (1), annotated with segment-level polarity labels (positive/neutral/negative). Annotations have been gathered on 2 levels of granulatiry:
The Multimodal Document Intent Dataset (MDID) is a dataset for computing author intent from multimodal data from Instagram. It contains 1,299 Instagram posts covering a variety of topics, annotated with labels from three taxonomies. The samples are labelled with 7 labels of intent: Provocative, Informative, Advocative, Entertainment, Expositive, Expressive, Promotive
ADE-Affordance is a new dataset that builds upon ADE20k, which contains annotations enabling such rich visual reasoning.
This is a dataset for segmentation and classification of epistemic activities in diagnostic reasoning texts.
The BuzzFeed-Webis Fake News Corpus 16 comprises the output of 9 publishers in a week close to the US elections. Among the selected publishers are 6 prolific hyperpartisan ones (three left-wing and three right-wing), and three mainstream publishers (see Table 1). All publishers earned Facebook’s blue checkmark, indicating authenticity and an elevated status within the network. For seven weekdays (September 19 to 23 and September 26 and 27), every post and linked news article of the 9 publishers was fact-checked by professional journalists at BuzzFeed. In total, 1,627 articles were checked, 826 mainstream, 256 left-wing and 545 right-wing. The imbalance between categories results from differing publication frequencies.
FakeNewsAMT & Celebrity include two novel datasets for the task of fake news detection, covering seven different news domains.
The Parsing Time Normalizations (PNT) corpus in SCATE format allows the representation of a wider variety of time expressions than previous approaches. This corpus was release with SemEval 2018 Task 6.