TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

GRB (Graph Robustness Benchmark)

Graph Robustness Benchmark (GRB) provides scalable, unified, modular, and reproducible evaluation on the adversarial robustness of graph machine learning models. GRB has elaborated datasets, unified evaluation pipeline, modular coding framework, and reproducible leaderboards, which facilitate the developments of graph adversarial learning, summarizing existing progress and generating insights into future research.

6 papers0 benchmarksGraphs

3D Lane Synthetic Dataset

This is a synthetic dataset constructed to stimulate the development and evaluation of 3D lane detection methods.

6 papers0 benchmarks

WikiNEuRal

WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.

6 papers0 benchmarksTexts

MMPTRACK (Multi-camera Multiple People Tracking Dataset)

Multi-camera Multiple People Tracking (MMPTRACK) dataset has about 9.6 hours of videos, with over half a million frame-wise annotations. The dataset is densely annotated, e.g., per-frame bounding boxes and person identities are available, as well as camera calibration parameters. Our dataset is recorded with 15 frames per second (FPS) in five diverse and challenging environment settings., e.g., retail, lobby, industry, cafe, and office. This is by far the largest publicly available multi-camera multiple people tracking dataset.

6 papers1 benchmarksTracking, Videos

MDBD (Multicue Dataset for Edge Detection)

In order to study the interaction of several early visual cues (luminance, color, stereo, motion) during boundary detection in challenging natural scenes, we have built a multi-cue video dataset composed of short binocular video sequences of natural scenes using a consumer-grade Fujifilm stereo camera (Mély, Kim, McGill, Guo and Serre, 2016). We considered a variety of places (from university campuses to street scenes and parks) and seasons to minimize possible biases. We attempted to capture more challenging scenes for boundary detection by framing a few dominant objects in each shot under a variety of appearances. Representative sample keyframes are shown on the figure below. The dataset contains 100 scenes, each consisting of a left and right view short (10-frame) color sequence. Each sequence was sampled at a rate of 30 frames per second. Each frame has a resolution of 1280 by 720 pixels.

6 papers4 benchmarks

NASA C-MAPSS-2 (Turbofan Engine Degradation Simulation Data Set-2)

The generation of data-driven prognostics models requires the availability of datasets with run-to-failure trajectories. In order to contribute to the development of these methods, the dataset provides a new realistic dataset of run-to-failure trajectories for a small fleet of aircraft engines under realistic flight conditions. The damage propagation modelling used for the generation of this synthetic dataset builds on the modeling strategy from previous work . The dataset was generated with the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dynamical model. The data set is been provided by the Prognostics CoE at NASA Ames in collaboration with ETH Zurich and PARC.

6 papers1 benchmarksTime series

Concepticon (Concepticon. A Resource for the Linking of Concept Lists)

This resource, our Concepticon, links concept labels from different conceptlists to concept sets. Each concept set is given a unique identifier, a unique label, and a human-readable definition. Concept sets are further structured by defining different relations between the concepts, as you can see in the graphic to the right, which displays the relations between concept sets linked to the concept set SIBLING. The resource can be used for various purposes. Serving as a rich reference for new and existing databases in diachronic and synchronic linguistics, it allows researchers a quick access to studies on semantic change, cross-linguistic polysemies, and semantic associations.

6 papers0 benchmarksTabular

PRONOSTIA Bearing Dataset

The PRONOSTIA (also called FEMTO) bearing dataset consists of 17 accelerated run-to-failures on a small bearing test rig. Both acceleration and temperature data was collected for each experiment.

6 papers0 benchmarksTime series

LSA64 (LSA64: A Dataset for Argentinian Sign Language)

The sign database for the Argentinian Sign Language, created with the goal of producing a dictionary for LSA and training an automatic sign recognizer, includes 3200 videos where 10 non-expert subjects executed 5 repetitions of 64 different types of signs. Signs were selected among the most commonly used ones in the LSA lexicon, including both verbs and nouns.

6 papers1 benchmarks

CeyMo

CeyMo is a novel benchmark dataset for road marking detection which covers a wide variety of challenging urban, sub-urban and rural road scenarios. The dataset consists of 2887 total images of 1920 × 1080 resolution with 4706 road marking instances belonging to 11 classes. The test set is divided into six categories: normal, crowded, dazzle light, night, rain and shadow.

6 papers1 benchmarksImages

ASCEND

ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese code-switching corpus collected in Hong Kong. ASCEND includes 23 bilinguals that are fluent in both Chinese and English and consists of 10.62 hours clean speech corpus.

6 papers0 benchmarksAudio, Speech

Common Phone

Common Phone is a gender-balanced, multilingual corpus recorded from more than 76.000 contributors via Mozilla's Common Voice project. It comprises around 116 hours of speech enriched with automatically generated phonetic segmentation.

6 papers0 benchmarksSpeech

MARIDA (Marine Debris Archive)

MARIDA (Marine Debris Archive) is the first dataset based on the multispectral Sentinel-2 (S2) satellite data, which distinguishes Marine Debris from various marine features that co-exist, including Sargassum macroalgae, Ships, Natural Organic Material, Waves, Wakes, Foam, dissimilar water types (i.e., Clear, Turbid Water, Sediment-Laden Water, Shallow Water), and Clouds. MARIDA is an open-access dataset which enables the research community to explore the spectral behaviour of certain floating materials, sea state features and water types, to develop and evaluate Marine Debris detection solutions based on artificial intelligence and deep learning architectures, as well as satellite pre-processing pipelines. Although it is designed to be beneficial for several machine learning tasks, it primarily aims to benchmark weakly supervised pixel-level semantic segmentation learning methods.

6 papers6 benchmarksImages

Colored-MNIST(with spurious correlation)

This is a dataset with spurious correlations which can be used to evaluate machine learning methods for out-of-distribution generalization, causal inference, and related field.

6 papers1 benchmarks

SemEval-2020 Task-8

A multimodal dataset for sentiment analysis on internet memes.

6 papers0 benchmarksImages, Texts

DSC (10 tasks) (Task Incremental Document Sentiment Classification)

A set of 10 DSC datasets (reviews of 10 products) to produce sequences of tasks. The products are Sports, Toys, Tools, Video, Pet, Musical, Movies, Garden, Offices, and Kindle. 2500 positive and 2500 negative training reviews per task . The validation reviews are with 250 positive and 250 negative and the test reviews are with 250 positive and 250 negative reviews. The detailed statistic on page https://github.com/ZixuanKe/PyContinual

6 papers1 benchmarks

F-CelebA (10 tasks) (Federated-CelebA (10 tasks))

F-CelebA - This dataset is adapted from federated learning. Federated learning is an emerging machine learning paradigm with an emphasis on data privacy. The idea is to train through model aggregation rather than conventional data aggregation and keep local data staying on the local device. This dataset naturally consists of similar tasks and each of the 10 tasks contains images of a celebrity labeled by whether he/she is smiling or not. More detailed please check page https://github.com/ZixuanKe/CAT

6 papers1 benchmarks

DLR-ACD

The DLR-ACD dataset is a collection of aerial images for crowd counting and density estimation, as well as for person localization at mass events. It contains 33 large aerial images acquired through 16 different flight campaigns at various mass events and over urban scenes involving crowds, such as sport events, city centers, open-air fairs and festivals.

6 papers6 benchmarks

GLips (German Lips)

The German Lipreading dataset consists of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament, which was processed for word-level lip reading using an automatic pipeline. The format is similar to that of the English language Lip Reading in the Wild (LRW) dataset, with each H264-compressed MPEG-4 video encoding one word of interest in a context of 1.16 seconds duration, which yields compatibility for studying transfer learning between both datasets. Choosing video material based on naturally spoken language in a natural environment ensures more robust results for real-world applications than artificially generated datasets with as little noise as possible. The 500 different spoken words ranging between 4-18 characters in length each have 500 instances and separate MPEG-4 audio- and text metadata-files, originating from 1018 parliamentary sessions. Additionally, the complete TextGrid files containing the segmentation information of those sessions are also

6 papers0 benchmarksAudio, Texts, Videos

Reddit Conversation Corpus

Reddit Conversation Corpus (RCC) consists of conversations, scraped from Reddit, for a 20 month period from November 2016 until August 2018. To ensure the quality and diversity of topics, 95 subreddits are selected from which conversations are collected. In total, RCC contains 9.2 million 3-turn conversations.

6 papers0 benchmarksTexts
PreviousPage 201 of 1000Next