Datasets

3,148 machine learning datasets

3,148 dataset results

Controversial News Topic Datasets

Corpus of controversial news articles extracted from Twitter. Contains news from three different topics: Beef Ban – controversy over the slaughter and sale of beef on religious grounds (1543 articles) is localised to a particular region, mainly Indian subcontinent, while Gun Control – restrictions on carrying, using, or purchasing firearms (6494 articles) and Capital Punishment – use of the death penalty (7905 articles) are topical in various regions around the world.

1 papers0 benchmarksTexts

COSTRA 1.0

COSTRA 1.0 is a dataset of complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing. The first version of the dataset is limited to sentences in Czech but the construction method is universal and the authors plan to use it also for other languages. The dataset consist of 4,262 unique sentences with average length of 10 words, illustrating 15 types of modifications such as simplification, generalization, or formal and informal language variation.

1 papers0 benchmarksTexts

COVID19-CountryImage

The Covid19-CountryImage dataset is a Twitter dataset which contains COVID-19-related tweets.

1 papers0 benchmarksTexts

CUHK-QA

CUHK-QA is a dataset for natural language-based person search using iterative questioning.

1 papers0 benchmarksImages, Texts

CzEng 2.0 Parallel Corpus

Czech-English parallel corpus CzEng 2.0 consisting of over 2 billion words (2 "gigawords") in each language. The corpus contains document-level information and is filtered with several techniques to lower the amount of noise.

1 papers0 benchmarksTexts

DAIS

A large benchmark dataset containing 50K human judgments for 5K distinct sentence pairs in the English dative alternation. This dataset includes 200 unique verbs and systematically varies the definiteness and length of arguments.

1 papers0 benchmarksTexts

DBpedia NIF

The dataset provides the content of all articles for 128 Wikipedia languages. The dataset has been further enriched with about 25% more links and selected partitions published as Linked Data.

1 papers0 benchmarksTexts

DpgMedia2019

DpgMedia2019 is a Dutch news dataset for partisanship detection. It contains more than 100K articles that are labelled on the publisher level and 776 articles that were crowdsourced using an internal survey platform and labelled on the article level.

1 papers0 benchmarksTexts

Europarl ConcoDisco Dataset

The ConcoDisco Corpus is an English-French parallel corpus with discourse relations (DRs) and discourse connectives (DCs) annotations.

1 papers0 benchmarksTexts

Fon-French Dataset

FFR Dataset is an ongoing project to collect, clean and store corpora of Fon and French sentences for machine translation from Fon-French. Fon (also called Fongbe) is an African-indigenous language spoken mostly in Benin, by about 1.7 million people. As training data is crucial to the high performance of a machine learning model, the aim of the project is to compile the largest set of training corpora for the research and design of translation and NLP models involving Fon. There are 117,029 parallel Fon-French sentences at the moment.

1 papers0 benchmarksTexts

FSOCO

FSOCO is a collaborative dataset for vision-based cone detection systems in Formula Student Driverless competitions. It contains human annotated ground truth labels for both bounding boxes and instance-wise segmentation masks. The data buy-in philosophy of FSOCO asks student teams to contribute to the database first before being granted access ensuring continuous growth. By providing clear labeling guidelines and tools for a sophisticated raw image selection, new annotations are guaranteed to meet the desired quality.

1 papers0 benchmarksImages, Texts

GameWikiSum

GameWikiSum is a domain-specific (video game) dataset for multi-document summarization, which is one hundred times larger than commonly used datasets, and in another domain than news. Input documents consist of long professional video game reviews as well as references of their gameplay sections in Wikipedia pages.

1 papers0 benchmarksTexts

GASP

GASP is a dataset composed by a list of cited abstracts associated with the corresponding source abstract. The dataset is composed by a training set of 100000 elements, a test set and a validation set of 10000 each. The goal is to generate a paper abstract given cited paper's abstracts and model the human creativity behind the process.

1 papers0 benchmarksTexts

Gigaword Entailment

The Gigaword Entailment dataset is a dataset for entailment prediction between an article and its headline. It is built from the Gigaword dataset.

1 papers0 benchmarksTexts

Gutenberg Dialog Dataset

This is a high-quality dataset consisting of 14.8M utterances in English, extracted from processed dialogues from publicly available online books.

1 papers0 benchmarksDialog, Texts

Houses Dataset

This dataset is used for predicting house prices from both images and textual information. It is composed of 535 sample houses from California, USA.

1 papers0 benchmarksImages, Texts

Ice Hockey News Dataset

Ice Hockey News Dataset is a corpus of Finnish ice hockey news, edited to be suitable for training of end-to-end news generation methods, as well as demonstrate generation of text, which was judged by journalists to be relatively close to a viable product.

1 papers0 benchmarksTexts

PreviousPage 104 of 158Next

Datasets

Controversial News Topic Datasets

COSTRA 1.0

COVID19-CountryImage

CUHK-QA

CzEng 2.0 Parallel Corpus

DAIS

DBpedia NIF

DpgMedia2019

Europarl ConcoDisco Dataset

Fon-French Dataset

FSOCO

GameWikiSum

GASP

Gigaword Entailment

Gutenberg Dialog Dataset

Houses Dataset

Ice Hockey News Dataset

IgboNLP Datasets

Incremental Dialog Dataset

Interpretable STS

Datasets

Controversial News Topic Datasets

COSTRA 1.0

COVID19-CountryImage

CUHK-QA

CzEng 2.0 Parallel Corpus

DAIS

DBpedia NIF

DpgMedia2019

Europarl ConcoDisco Dataset

Fon-French Dataset

FSOCO

GameWikiSum

GASP

Gigaword Entailment

Gutenberg Dialog Dataset

Houses Dataset

Ice Hockey News Dataset

IgboNLP Datasets

Incremental Dialog Dataset

Interpretable STS