Datasets

19,997 machine learning datasets

19,997 dataset results

MNIST-M

MNIST-M is created by combining MNIST digits with the patches randomly extracted from color photos of BSDS500 as their background. It contains 59,001 training and 90,001 test images.

193 papers0 benchmarksImages

BioASQ (Biomedical Semantic Indexing and Question Answering)

BioASQ is a question answering dataset. Instances in the BioASQ dataset are composed of a question (Q), human-annotated answers (A), and the relevant contexts (C) (also called snippets).

192 papers1 benchmarksTexts

MOTChallenge

The MOTChallenge datasets are designed for the task of multiple object tracking. There are several variants of the dataset released each year, such as MOT15, MOT17, MOT20.

192 papers0 benchmarksImages, Videos

Exchange (Exchange Rate Multivariate)

Daily exchange rates of eight countries’ currencies against the US dollar, spanning from 1990 to 2010 with 7588 timesteps in total.

192 papers0 benchmarksTime series

BC5CDR (BioCreative V CDR corpus)

BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.

191 papers2 benchmarksTexts

The "Flying Chairs" are a synthetic dataset with optical flow ground truth. It consists of 22872 image pairs and corresponding flow fields. Images show renderings of 3D chair models moving in front of random backgrounds from Flickr. Motions of both the chairs and the background are purely planar.

191 papers0 benchmarks

CMU-MOSEI

CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) is the largest dataset of sentence-level sentiment analysis and emotion recognition in online videos. CMU-MOSEI contains over 12 hours of annotated video from over 1000 speakers and 250 topics.

190 papers13 benchmarksAudio, Images, Texts, Videos

XQuAD

XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently, the dataset is entirely parallel across 11 languages.

190 papers3 benchmarksTexts

FewRel (Few-Shot Relation Classification Dataset)

The FewRel (Few-Shot Relation Classification Dataset) contains 100 relations and 70,000 instances from Wikipedia. The dataset is divided into three subsets: training set (64 relations), validation set (16 relations) and test set (20 relations).

189 papers13 benchmarksTexts

IHDP (Infant Health and Development Program)

The Infant Health and Development Program (IHDP) is a randomized controlled study designed to evaluate the effect of home visit from specialist doctors on the cognitive test scores of premature infants. The datasets is first used for benchmarking treatment effect estimation algorithms in Hill [35], where selection bias is induced by removing non-random subsets of the treated individuals to create an observational dataset, and the outcomes are generated using the original covariates and treatments. It contains 747 subjects and 25 variables.

189 papers2 benchmarks

LRW (Lip Reading in the Wild)

The Lip Reading in the Wild (LRW) dataset a large-scale audio-visual database that contains 500 different words from over 1,000 speakers. Each utterance has 29 frames, whose boundary is centered around the target word. The database is divided into training, validation and test sets. The training set contains at least 800 utterances for each class while the validation and test sets contain 50 utterances.

188 papers63 benchmarksAudio, Texts, Videos

RESISC45

RESISC45 dataset is a dataset for Remote Sensing Image Scene Classification (RESISC). It contains 31,500 RGB images of size 256×256 divided into 45 scene classes, each class containing 700 images. Among its notable features, RESISC45 contains varying spatial resolution ranging from 20cm to more than 30m/px.

187 papers5 benchmarksImages

CoNLL

The CoNLL dataset is a widely used resource in the field of natural language processing (NLP). The term “CoNLL” stands for Conference on Natural Language Learning. It originates from a series of shared tasks organized at the Conferences of Natural Language Learning.

187 papers0 benchmarks

Celeb-DF

Celeb-DF is a large-scale challenging dataset for deepfake forensics. It includes 590 original videos collected from YouTube with subjects of different ages, ethnic groups and genders, and 5639 corresponding DeepFake videos.

187 papers0 benchmarksImages

SUNCG

SUNCG is a large-scale dataset of synthetic 3D scenes with dense volumetric annotations.

186 papers0 benchmarks3D, Images, RGB-D

SGD (Schema-Guided Dialogue)

The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. These conversations involve interactions with services and APIs spanning 20 domains, ranging from banks and events to media, calendar, travel, and weather. For most of these domains, the dataset contains multiple different APIs, many of which have overlapping functionalities but different interfaces, which reflects common real-world scenarios. The wide range of available annotations can be used for intent prediction, slot filling, dialogue state tracking, policy imitation learning, language generation, user simulation learning, among other tasks in large-scale virtual assistants. Besides these, the dataset has unseen domains and services in the evaluation set to quantify the performance in zero-shot or few shot settings.

186 papers3 benchmarksTexts

Extended Yale B

The Extended Yale B database contains 2414 frontal-face images with size 192×168 over 38 subjects and about 64 images per subject. The images were captured under different lighting conditions and various facial expressions.

185 papers0 benchmarksImages

Argoverse 2

Argoverse 2 (AV2) is a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated Sensor Dataset contains 1,000 sequences of multimodal data, encompassing high-resolution imagery from seven ring cameras, and two stereo cameras in addition to lidar point clouds, and 6-DOF map-aligned pose. Sequences contain 3D cuboid annotations for 26 object categories, all of which are sufficiently-sampled to support training and evaluation of 3D perception models. The Lidar Dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose. This dataset is the largest ever collection of lidar sensor data and supports self-supervised learning and the emerging task of point cloud forecasting. Finally, the Motion Forecasting Dataset contains 250,000 scenarios mined for interesting and challenging interactions be- tween the autonomous vehicle and other actors in each local scene. Models are tasked with the prediction of future motion

185 papers7 benchmarksLiDAR

Open Graph Benchmark

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

184 papers0 benchmarks

SciQ

The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.

183 papers1 benchmarksTexts

PreviousPage 18 of 1000Next

Datasets

MNIST-M

BioASQ (Biomedical Semantic Indexing and Question Answering)

MOTChallenge

Exchange (Exchange Rate Multivariate)

BC5CDR (BioCreative V CDR corpus)

FlyingChairs

CMU-MOSEI

XQuAD

FewRel (Few-Shot Relation Classification Dataset)

IHDP (Infant Health and Development Program)

LRW (Lip Reading in the Wild)

RESISC45

CoNLL

Celeb-DF

SUNCG

SGD (Schema-Guided Dialogue)

Extended Yale B

Argoverse 2

Open Graph Benchmark

SciQ

Datasets

MNIST-M

BioASQ (Biomedical Semantic Indexing and Question Answering)

MOTChallenge

Exchange (Exchange Rate Multivariate)

BC5CDR (BioCreative V CDR corpus)

FlyingChairs

CMU-MOSEI

XQuAD

FewRel (Few-Shot Relation Classification Dataset)

IHDP (Infant Health and Development Program)

LRW (Lip Reading in the Wild)

RESISC45

CoNLL

Celeb-DF

SUNCG

SGD (Schema-Guided Dialogue)

Extended Yale B

Argoverse 2

Open Graph Benchmark

SciQ