Datasets

19,997 machine learning datasets

19,997 dataset results

Atari 100k

Atari Games for only 100k environment steps. (400k frames with frame-skip=4).

CreditRisk (TransUnion TransRisks Scores)

Dataset containing Credit scores and loan repayment rate (90-day default rate) for individuals, separated by race (white, black, Hispanic Asian).

62 papers0 benchmarks

TinyStories

Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary.

62 papers0 benchmarksTexts

SFEW (Static Facial Expression in the Wild)

The Static Facial Expressions in the Wild (SFEW) dataset is a dataset for facial expression recognition. It was created by selecting static frames from the AFEW database by computing key frames based on facial point clustering. The most commonly used version, SFEW 2.0, was the benchmarking data for the SReco sub-challenge in EmotiW 2015. SFEW 2.0 has been divided into three sets: Train (958 samples), Val (436 samples) and Test (372 samples). Each of the images is assigned to one of seven expression categories, i.e., anger, disgust, fear, neutral, happiness, sadness, and surprise. The expression labels of the training and validation sets are publicly available, whereas those of the testing set are held back by the challenge organizer.

61 papers6 benchmarksImages, Videos

MSVD-QA

The MSVD-QA dataset is a Video Question Answering (VideoQA) dataset. It is based on the existing Microsoft Research Video Description (MSVD) dataset, which consists of about 120K sentences describing more than 2,000 video snippets. In the MSVD-QA dataset, Question-Answer (QA) pairs are generated from these descriptions. The dataset is mainly used in video captioning experiments but due to its large data size, it is also used for VideoQA. It contains 1970 video clips and approximately 50.5K QA pairs.

61 papers7 benchmarks

SIP (Salient Person)

The Salient Person dataset (SIP) contains 929 salient person samples with different poses and illumination conditions.

61 papers20 benchmarksImages

LIP (Look into Person)

The LIP (Look into Person) dataset is a large-scale dataset focusing on semantic understanding of a person. It contains 50,000 images with elaborated pixel-wise annotations of 19 semantic human part labels and 2D human poses with 16 key points. The images are collected from real-world scenarios and the subjects appear with challenging poses and view, heavy occlusions, various appearances and low resolution.

61 papers0 benchmarksImages

FigureQA

FigureQA is a visual reasoning corpus of over one million question-answer pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts.

61 papers0 benchmarksImages, Texts

PanNuke

PanNuke is a semi automatically generated nuclei instance segmentation and classification dataset with exhaustive nuclei labels across 19 different tissue types. The dataset consists of 481 visual fields, of which 312 are randomly sampled from more than 20K whole slide images at different magnifications, from multiple data sources. In total the dataset contains 205,343 labeled nuclei, each with an instance segmentation mask.

61 papers11 benchmarksImages, Medical

BTAD (beanTech Anomaly Detection)

The BTAD ( beanTech Anomaly Detection) dataset is a real-world industrial anomaly dataset. The dataset contains a total of 2830 real-world images of 3 industrial products showcasing body and surface defects.

61 papers4 benchmarksImages

WebQuestionsSP (WebQuestions Semantic Parses Dataset)

The WebQuestionsSP dataset is released as part of our ACL-2016 paper “The Value of Semantic Parse Labeling for Knowledge Base Question Answering” [Yih, Richardson, Meek, Chang & Suh, 2016], in which we evaluated the value of gathering semantic parses, vs. answers, for a set of questions that originally comes from WebQuestions [Berant et al., 2013]. The WebQuestionsSP dataset contains full semantic parses in SPARQL queries for 4,737 questions, and “partial” annotations for the remaining 1,073 questions for which a valid parse could not be formulated or where the question itself is bad or needs a descriptive answer. This release also includes an evaluation script and the output of the STAGG semantic parsing system when trained using the full semantic parses. More detail can be found in the document and labeling instructions included in this release, as well as the paper.

61 papers4 benchmarksTexts

CIRR (Compose Image Retrieval on Real-life images)

Composed Image Retrieval (or, Image Retreival conditioned on Language Feedback) is a relatively new retrieval task, where an input query consists of an image and short textual description of how to modify the image.

61 papers12 benchmarksImages, Texts

CIFAR-10H

CIFAR-10H is a new dataset of soft labels reflecting human perceptual uncertainty for the 10,000-image CIFAR-10 test set. This contains 1,000 images for each of the 10 categories in the original CIFAR-10 dataset.

61 papers0 benchmarks

Game of 24

Game of 24 is a mathematical reasoning challenge, where the goal is to use 4 numbers and basic arithmetic operations (+-*/) to obtain 24. For example, given input “4 9 10 13”, a solution output could be “(10 - 4) * (13 - 9) = 24”. We scrape data from 4nums.com, which has 1,362 games that are sorted from easy to hard by human solving time, and use a subset of relatively hard games indexed 901-1,000 for testing. For each task, we consider the output as success if it is a valid equation that equals 24 and uses the input numbers each exactly once. We report the success rate across 100 games as the metric.

61 papers1 benchmarksTexts

GAIA (a benchmark for general AI assistants)

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA’s philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system’s capability to exhibit similar robustness as the average human does on such questions. Using GAIA’s methodology, we devise 466 questions and their answer. W

61 papers0 benchmarks

GAP (GAP Benchmark Suite)

GAP is a graph processing benchmark suite with the goal of helping to standardize graph processing evaluations. Fewer differences between graph processing evaluations will make it easier to compare different research efforts and quantify improvements. The benchmark not only specifies graph kernels, input graphs, and evaluation methodologies, but it also provides optimized baseline implementations. These baseline implementations are representative of state-of-the-art performance, and thus new contributions should outperform them to demonstrate an improvement. The input graphs are sized appropriately for shared memory platforms, but any implementation on any platform that conforms to the benchmark's specifications could be compared. This benchmark suite can be used in a variety of settings. Graph framework developers can demonstrate the generality of their programming model by implementing all of the benchmark's kernels and delivering competitive performance on all of the benchmark's gra

60 papers5 benchmarksGraphs

ToTTo

ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description.

60 papers6 benchmarksTexts

TVQA+

TVQA+ contains 310.8K bounding boxes, linking depicted objects to visual concepts in questions and answers.

60 papers0 benchmarksTexts, Videos

VOCASET

VOCASET is a 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio. The dataset has 12 subjects and 480 sequences of about 3-4 seconds each with sentences chosen from an array of standard protocols that maximize phonetic diversity.

60 papers7 benchmarks3D, Speech

PF-PASCAL

60 papers4 benchmarks

PreviousPage 49 of 1000Next