19,997 machine learning datasets
19,997 dataset results
The Abstraction and Reasoning Corpus (ARC) is a dataset created by François Chollet in 2019. It’s designed to measure the gap between machine and human learning. The dataset consists of 1000 image-based reasoning tasks. Each task provides an input image and asks for an output image. The goal is to solve these tasks using a system that can understand and learn abstract concepts, and apply reasoning skills to generate the correct output. This dataset poses a significant challenge for AI systems and is used to advance research in artificial intelligence and machine learning.
Recent applications of LLMs in Machine Reading Comprehension (MRC) systems have shown impressive results, but the use of shortcuts, mechanisms triggered by features spuriously correlated to the true label, has emerged as a potential threat to their reliability. We analyze the problem from two angles: LLMs as editors, guided to edit text to mislead LLMs; and LLMs as readers, who answer questions based on the edited text. We introduce a framework that guides an editor to add potential shortcuts-triggers to samples. Using GPT4 as the editor, we find it can successfully edit trigger shortcut in samples that fool LLMs. Analysing LLMs as readers, we observe that even capable LLMs can be deceived using shortcut knowledge. Strikingly, we discover that GPT4 can be deceived by its own edits (15% drop in F1). Our findings highlight inherent vulnerabilities of LLMs to shortcut manipulations. We publish ShortcutQA, a curated dataset generated by our framework for future research.
GQA-OOD is a new dataset and benchmark for the evaluation of VQA models in OOD (out of distribution) settings.
KGRC-RDF-star is an RDF-star dataset converted from KGRC-RDF, which is a Knowledge graph dataset of novel stories.
Action with RAre Scene is a small scale dataset collected from Youtube. By definition, it includes video clips of human actions (those action categories fall into Kinetics-400 action classes) with rare scenes or backgrounds.
The MegaNegRaising dataset, also known as MegaNeRd, is a collection of data that captures patterns of neg-raising inferences and acceptability judgments for 925 clause-embedding verbs of English in various syntactic structures. It is part of a larger project that investigates lexically triggered inferences across clause-embedding verbs in English.
Online social platforms serve a critical role for individuals as they seek to fill informational and emotional needs, from informational support like advice to emotional support like expressions of sympathy, frequently by interacting with others. The supportive replies of others help promote personal well-being, yet unsupportive replies can not only lead to distress but discourage online engagement altogether. In this work, we aim to study support in general - everyday interactions - drawing upon theories of how support is expressed in language. Our work is motivated by an agenda of promoting supportive online platforms where people can participate equally.
The SweRec dataset in ScandEval is a Swedish language dataset used for text classification tasks. It contains strings of text, each associated with a label indicating the sentiment of the text. The labels are "positive", "negative", or "neutral", representing the sentiment expressed in the text.
The Stack Exchange dataset is a collection of data from various Stack Exchange sites, including Stack Overflow, Mathematics, Super User, and many others. It includes questions, answers, comments, tags, and other related data from these sites.
The LanguageNet (English) is a collection of sentence level paraphrases from Twitter by linking tweets through shared URLs. This corpus is the largest up to date with 51,524 human annotated sentence pairs: 42200 for training and 9324 for testing. It can grow 30,000 new sentential paraphrases per month with ~70% precision. Now we have 1-year data available: 2,869,657 candidate pairs!
The Polish Paraphrase Corpus (PPC) is a dataset consisting of 7000 manually labeled sentence pairs in Polish. The purpose of creating this dataset was to verify how machine learning models perform in the challenging problem of paraphrase identification, where most records contain semantically overlapping parts. The dataset was divided into training, validation, and test splits, and each record was assigned to one of three categories: exact paraphrases, close paraphrases, or non-paraphrases. The corpus was created by automatically generating candidate pairs and then manually labeling them. The extracted sentence pairs were drawn from different data sources, including Taboeba, Polish news articles, Wikipedia, and the Polish version of the SICK dataset.
This is a challenge set for machine translation that contains 32G translation units in 2,539 bitexts. The whole data set covers 487 languages linked to each other in 4,024 language pairs. The package includes a release of 657 test sets derived from Tatoeba.org that cover 138 languages. Training data is compiled from various sources collected within the OPUS project.
The dataset of the paper: ``Dataset and Case Studies for Visual Near-Duplicates Detection in the Context of Social Media'', by Hana Matatov, Mor Naaman, and Ofra Amir.
The EmoTag1200 dataset is a collection of resources for analyzing the emotion and sentiment of emojis as well as tweets written in English. The name EmoTag indicates its usefulness in exploiting emojis for emotional tagging.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Avocado Research Email Collection consists of emails and attachments taken from 279 accounts of a defunct information technology company referred to as "Avocado". Most of the accounts are those of Avocado employees; the remainder represent shared accounts such as "Leads", or system accounts such as "Conference Room Upper Canada".
Wastewater catchment area data are essential for wastewater treatment capacity planning and have recently become critical for operationalising wastewater-based epidemiology (WBE) for COVID-19. Owing to the privatised nature of the water industry in the United Kingdom, the required catchment area datasets are not readily available to researchers. Here, we present a consolidated dataset of 7,537 catchment areas from ten sewerage service providers in the Great Britain, covering more than 96% of the population of England and Wales.
High Quality Indoor Monocular Depth Estimation Dataset with focus on performance variation across space type
This dataset is made up of forward-looking sonar images containing ten classes of underwater debris. The dataset can be used for segmentation or object detection. Applications include training computer vision models for underwater robotics applications.
Wyze Rule Recommendation Dataset. It is a big dataset with 300,000 users. Please cite [1] if you used the dataset and cite [2] if you referenced the algorithm.