3,148 machine learning datasets
3,148 dataset results
Corpus of controversial news articles extracted from Twitter. Contains news from three different topics: Beef Ban – controversy over the slaughter and sale of beef on religious grounds (1543 articles) is localised to a particular region, mainly Indian subcontinent, while Gun Control – restrictions on carrying, using, or purchasing firearms (6494 articles) and Capital Punishment – use of the death penalty (7905 articles) are topical in various regions around the world.
COSTRA 1.0 is a dataset of complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing. The first version of the dataset is limited to sentences in Czech but the construction method is universal and the authors plan to use it also for other languages. The dataset consist of 4,262 unique sentences with average length of 10 words, illustrating 15 types of modifications such as simplification, generalization, or formal and informal language variation.
The Covid19-CountryImage dataset is a Twitter dataset which contains COVID-19-related tweets.
CUHK-QA is a dataset for natural language-based person search using iterative questioning.
Czech-English parallel corpus CzEng 2.0 consisting of over 2 billion words (2 "gigawords") in each language. The corpus contains document-level information and is filtered with several techniques to lower the amount of noise.
A large benchmark dataset containing 50K human judgments for 5K distinct sentence pairs in the English dative alternation. This dataset includes 200 unique verbs and systematically varies the definiteness and length of arguments.
The dataset provides the content of all articles for 128 Wikipedia languages. The dataset has been further enriched with about 25% more links and selected partitions published as Linked Data.
DpgMedia2019 is a Dutch news dataset for partisanship detection. It contains more than 100K articles that are labelled on the publisher level and 776 articles that were crowdsourced using an internal survey platform and labelled on the article level.
The ConcoDisco Corpus is an English-French parallel corpus with discourse relations (DRs) and discourse connectives (DCs) annotations.
FFR Dataset is an ongoing project to collect, clean and store corpora of Fon and French sentences for machine translation from Fon-French. Fon (also called Fongbe) is an African-indigenous language spoken mostly in Benin, by about 1.7 million people. As training data is crucial to the high performance of a machine learning model, the aim of the project is to compile the largest set of training corpora for the research and design of translation and NLP models involving Fon. There are 117,029 parallel Fon-French sentences at the moment.
FSOCO is a collaborative dataset for vision-based cone detection systems in Formula Student Driverless competitions. It contains human annotated ground truth labels for both bounding boxes and instance-wise segmentation masks. The data buy-in philosophy of FSOCO asks student teams to contribute to the database first before being granted access ensuring continuous growth. By providing clear labeling guidelines and tools for a sophisticated raw image selection, new annotations are guaranteed to meet the desired quality.
GameWikiSum is a domain-specific (video game) dataset for multi-document summarization, which is one hundred times larger than commonly used datasets, and in another domain than news. Input documents consist of long professional video game reviews as well as references of their gameplay sections in Wikipedia pages.
GASP is a dataset composed by a list of cited abstracts associated with the corresponding source abstract. The dataset is composed by a training set of 100000 elements, a test set and a validation set of 10000 each. The goal is to generate a paper abstract given cited paper's abstracts and model the human creativity behind the process.
The Gigaword Entailment dataset is a dataset for entailment prediction between an article and its headline. It is built from the Gigaword dataset.
This is a high-quality dataset consisting of 14.8M utterances in English, extracted from processed dialogues from publicly available online books.
This dataset is used for predicting house prices from both images and textual information. It is composed of 535 sample houses from California, USA.
Ice Hockey News Dataset is a corpus of Finnish ice hockey news, edited to be suitable for training of end-to-end news generation methods, as well as demonstrate generation of text, which was judged by journalists to be relatively close to a viable product.
IgboNLP is a standard machine translation benchmark dataset for Igbo. It consists of 10,000 English-Igbo human-level quality sentence pairs mostly from the news domain.
Simulates unanticipated user needs in the deployment stage.
A dataset of sentence pairs annotated following the formalization.