TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

OST (Egocentric Dataset)

Is one of the largest egocentric datasets in the object search task with eyetracking information available

3 papers0 benchmarks

DMQA (DeepMind Q&A)

The DeepMind Q&A Dataset consists of two datasets for Question Answering, CNN and DailyMail. Each dataset contains many documents (90k and 197k each), and each document companies on average 4 questions approximately. Each question is a sentence with one missing word/phrase which can be found from the accompanying document/context.

3 papers0 benchmarksTexts

DOGC

Intended to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies.

3 papers0 benchmarks

ECUSTFD (ECUST Food Dataset)

The ECUST Food Dataset is a food recognition dataset that contains 2978 images

3 papers0 benchmarksImages

esXNLI

esXNLI is a bilingual NLI dataset. It comprises 2,490 examples from 5 different genres that were originally annotated in Spanish, and translated into English by professional translators. It serves as a counterpoint to XNLI, which was originally annotated in English and translated into 14 other languages, including Spanish. The dataset was conceived to be used in conjunction with the XNLI development set to analyse the effect of translation in cross-lingual transfer learning.

3 papers0 benchmarksTexts

ETH Py150 Open

A massive, deduplicated corpus of 7.4M Python files from GitHub.

3 papers0 benchmarks

EXEQ-300k

The EXEQ-300k dataset contains 290,479 detailed questions with corresponding math headlines from Mathematics Stack Exchange. The dataset can be used to generate concise math headlines from detailed math questions.

3 papers0 benchmarksTexts

FB15k-237-low

The FB15k-237-low dataset is a variation of the FB15k-237 dataset where relations with a low number of triplets are kept.

3 papers0 benchmarksGraphs

FCDB (Fashion Culture DataBase)

Consists of 76 million geo-tagged images in 16 cosmopolitan cities.

3 papers0 benchmarks

Fraxtil

Fraxtil is an audio dataset where given a raw audio track, the goal is to produce a choreography step chart, similar to those used in the Dance Dance Revolution video game. It contains 90 songs choreographed by a single author, with 450 charts for the 90 songs.

3 papers0 benchmarksAudio

GeBioCorpus

A high-quality dataset for machine translation evaluation that aims at being one of the first non-synthetic gender-balanced test datasets.

3 papers0 benchmarksTexts

Goldfinch (GOogLe image-search Dataset)

Goldfinch is a dataset for fine-grained recognition challenges. It contains a list of bird, butterfly, aircraft, and dog categories with relevant Google image search and Flickr search URLs. In addition, it also includes a set of active learning annotations on dog categories.

3 papers0 benchmarksImages

HASY

HASY is a dataset of single symbols similar to MNIST. It contains 168,233 instances of 369 classes. HASY contains two challenges: A classification challenge with 10 pre-defined folds for 10-fold cross-validation and a verification challenge.

3 papers0 benchmarksImages

HCU400

The dataset consists of the features associated with 402 5-second sound samples. The 402 sounds range from easily identifiable everyday sounds to intentionally obscured artificial ones. The dataset aims to lower the barrier for the study of aural phenomenology as the largest available audio dataset to include an analysis of causal attribution. Each sample has been annotated with crowd-sourced descriptions, as well as familiarity, imageability, arousal, and valence ratings.

3 papers0 benchmarksAudio

Hollywood 3D dataset

A dataset for benchmarking action recognition algorithms in natural environments, while making use of 3D information. The dataset contains around 650 video clips, across 14 classes. In addition, two state of the art action recognition algorithms are extended to make use of the 3D data, and five new interest point detection strategies are also proposed, that extend to the 3D data.

3 papers0 benchmarks

HotelRec

Publicly available dataset in the hotel domain (50M versus 0.9M) and additionally, the largest recommendation dataset in a single domain and with textual reviews (50M versus 22M).

3 papers0 benchmarks

HSD (Honda Scenes Dataset)

An annotated dataset is released to enable dynamic scene classification that includes 80 hours of diverse high quality driving video data clips collected in the San Francisco Bay area. The dataset includes temporal annotations for road places, road types, weather, and road surface conditions.

3 papers0 benchmarksImages

iFakeFaceDB

iFakeFaceDB is a face image dataset for the study of synthetic face manipulation detection, comprising about 87,000 synthetic face images generated by the Style-GAN model and transformed with the GANprintR approach. All images were aligned and resized to the size of 224 x 224.

3 papers0 benchmarksImages

Image and Video Advertisements

The Image and Video Advertisements collection consists of an image dataset of 64,832 image ads, and a video dataset of 3,477 ads. The data contains rich annotations encompassing the topic and sentiment of the ads, questions and answers describing what actions the viewer is prompted to take and the reasoning that the ad presents to persuade the viewer ("What should I do according to this ad, and why should I do it? "), and symbolic references ads make (e.g. a dove symbolizes peace).

3 papers0 benchmarks

IndicNLP Corpus

The IndicNLP corpus is a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families.

3 papers0 benchmarksTexts
PreviousPage 261 of 1000Next