Datasets

19,997 machine learning datasets

19,997 dataset results

ImageNet-C

ImageNet-C is an open source data set that consists of algorithmically generated corruptions (blur, noise) applied to the ImageNet test-set.

602 papers9 benchmarksImages

DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created in the Wikipedia project. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.

597 papers5 benchmarksGraphs, Texts

Urban100

The Urban100 dataset contains 100 images of urban scenes. It commonly used as a test set to evaluate the performance of super-resolution models. Image Source: http://vllab.ucmerced.edu/wlai24/LapSRN/

591 papers0 benchmarksImages

VoxCeleb2

VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances from over 6k speakers. Since the dataset is collected ‘in the wild’, the speech segments are corrupted with real world noise including laughter, cross-talk, channel effects, music and other sounds. The dataset is also multilingual, with speech from speakers of 145 different nationalities, covering a wide range of accents, ages, ethnicities and languages. The dataset is audio-visual, so is also useful for a number of other applications, for example – visual speech synthesis, speech separation, cross-modal transfer from face to voice or vice versa and training face recognition from video to complement existing face recognition datasets.

564 papers3 benchmarksAudio, Images, Texts, Videos

WikiText-103

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

560 papers3 benchmarksTexts

LVIS

LVIS is a dataset for long tail instance segmentation. It has annotations for over 1000 object categories in 164k images.

551 papers0 benchmarksImages

VGGFace2 (Vggface2: A dataset for recognising faces across pose and age)

VGGFace2 is a large-scale face recognition dataset. Images are downloaded from Google Image Search and have large variations in pose, age, illumination, ethnicity and profession. VGGFace2 contains images from identities spanning a wide range of different ethnicities, accents, professions and ages. All face images are captured "in the wild", with pose and emotion variations and different lighting and occlusion conditions. Face distribution for different identities is varied, from 87 to 843, with an average of 362 images for each subject.

539 papers2 benchmarksImages

SYNTHIA (SYNTHetic Collection of Imagery and Annotations)

The SYNTHIA dataset is a synthetic dataset that consists of 9400 multi-viewpoint photo-realistic frames rendered from a virtual city and comes with pixel-level semantic annotations for 13 classes. Each frame has resolution of 1280 × 960.

538 papers2 benchmarksImages

D4RL

D4RL is a collection of environments for offline reinforcement learning. These environments include Maze2D, AntMaze, Adroit, Gym, Flow, FrankKitchen and CARLA.

538 papers3 benchmarksEnvironment

VGG-Face2 (Vggface2: A dataset for recognising faces across pose and age)

533 papers0 benchmarksImages

Set14

The Set14 dataset is a dataset consisting of 14 images commonly used for testing performance of Image Super-Resolution models. Image Source: https://www.ece.rice.edu/~wakin/images/

532 papers4 benchmarksImages

MT-Bench

This dataset contains 3.3K expert-level pairwise human preferences for model responses generated by 6 models in response to 80 MT-bench questions. The 6 models are GPT-4, GPT-3.5, Claud-v1, Vicuna-13B, Alpaca-13B, and LLaMA-13B. The annotators are mostly graduate students with expertise in the topic areas of each of the questions.

531 papers0 benchmarksTexts

CNN/Daily Mail

CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN and Daily Mail websites as questions (with one of the entities hidden), and stories as the corresponding passages from which the system is expected to answer the fill-in the-blank question. The authors released the scripts that crawl, extract and generate pairs of passages and questions from these websites.

530 papers7 benchmarksTexts

Places205

The Places205 dataset is a large-scale scene-centric dataset with 205 common scene categories. The training dataset contains around 2,500,000 images from these categories. In the training set, each scene category has the minimum 5,000 and maximum 15,000 images. The validation set contains 100 images per category (a total of 20,500 images), and the testing set includes 200 images per category (a total of 41,000 images).

525 papers1 benchmarksImages

DreamBooth

The DreamBooth dataset is a collection of images used for fine-tuning text-to-image diffusion models for subject-driven generation¹. Here are some key details about the dataset:

523 papers3 benchmarks

Universal Dependencies

The Universal Dependencies (UD) project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages. The first version of the dataset was released in 2015 and consisted of 10 treebanks over 10 languages. Version 2.7 released in 2020 consists of 183 treebanks over 104 languages. The annotation consists of UPOS (universal part-of-speech tags), XPOS (language-specific part-of-speech tags), Feats (universal morphological features), Lemmas, dependency heads and universal dependency labels.

520 papers4 benchmarksAudio, Texts

FGVC-Aircraft

FGVC-Aircraft contains 10,200 images of aircraft, with 100 images for each of 102 different aircraft model variants, most of which are airplanes. The (main) aircraft in each image is annotated with a tight bounding box and a hierarchical airplane model label. Aircraft models are organized in a four-levels hierarchy. The four levels, from finer to coarser, are:

520 papers5 benchmarksImages

FEVER (Fact Extraction and VERification)

FEVER is a publicly available dataset for fact extraction and verification against textual sources.

498 papers4 benchmarksTexts

MPII (MPII Human Pose)

The MPII Human Pose Dataset for single person pose estimation is composed of about 25K images of which 15K are training samples, 3K are validation samples and 7K are testing samples (which labels are withheld by the authors). The images are taken from YouTube videos covering 410 different human activities and the poses are manually annotated with up to 16 body joints.

495 papers3 benchmarksImages

CIFAR-10C

Common corruptions dataset for CIFAR10

494 papers3 benchmarksImages

PreviousPage 6 of 1000Next