TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

CLINC-Single-Domain-OOS

A dataset with two separate domains, i.e., the "Banking'' domain and the "Credit cards'' domain with both general Out-of-Scope (OOD-OOS) queries and In-Domain but Out-of-Scope (ID-OOS) queries, where ID-OOS queries are semantically similar intents/queries with in-scope intents. Each domain in CLINC150 originally includes 15 intents. Each domain includes ten in-scope intents in this dataset, and the ID-OOS queries are built up based on five held-out in-scope intents.

3 papers0 benchmarks

Multilingual TOP

Multilingual TOP is a dataset for multilingual semantic parsing with human-written sentences as opposed to machine translated ones. The dataset sentences are in English, Italian and Japanese and it is based on the Facebook Task Oriented Parsing (TOP) dataset.

3 papers0 benchmarksTexts

EMOTyDA (Emotion aware Dialogue Act)

EMOTyDA is a multimodal Emotion aware Dialogue Act dataset collected from open-sourced dialogue datasets.

3 papers1 benchmarksTexts

Dirty-MNIST

DirtyMNIST is a concatenation of MNIST + AmbiguousMNIST, with 60k samples each in the training set. AmbiguousMNIST contains additional ambiguous digits with varying ambiguity. The AmbiguousMNIST test set contains 60k ambiguous samples as well.

3 papers0 benchmarksImages

HuRDL (Human-Robot Dialogue Learning Corpus)

The Human-Robot Dialogue Learning (HuRDL) Corpus is a dataset about asking questions in situated task-based interactions. It is a dialogue corpus collected in an online interactive virtual environment in which human participants play the role of a robot performing a collaborative tool-organization task.

3 papers0 benchmarksTexts

FetReg

Fetoscopic Placental Vessel Segmentation and Registration (FetReg) is a large-scale multi-centre dataset for the development of generalized and robust semantic segmentation and video mosaicking algorithms for the fetal environment with a focus on creating drift-free mosaics from long duration fetoscopy videos.

3 papers0 benchmarksImages, Medical

BBBC039

This image set is part of a high-throughput chemical screen on U2OS cells, with examples of 200 bioactive compounds. The effect of the treatments was originally imaged using the Cell Painting assay (fluorescence microscopy). This data set only includes the DNA channel of a single field of view per compound. These images present a variety of nuclear phenotypes, representative of high-throughput chemical perturbations. The main use of this data set is the study of segmentation algorithms that can separate individual nucleus instances in an accurate way, regardless of their shape and cell density. The collection has around 23,000 single nuclei manually annotated to establish a ground truth collection for segmentation evaluation.

3 papers0 benchmarksImages

LARC (Language-annotated Abstraction and Reasoning)

LARC is a dataset built from ARC (Abstraction and Reasoning Corpus). ARC is a set of tasks that tests an agent's ability to flexibly solve novel problems. While most ARC tasks are easy for humans, they are challenging for state-of-the-art AI.

3 papers0 benchmarksTexts

EuroCrops

EuroCrops is a dataset for automatic vegetation classification from multi-spectral and multi-temporal satellite data, annotated with official LIPS reporting data from countries of the European Union, curated by the Technical University of Munich and GAF AG. The project is managed by the DLR Space Administration and funded by BMWI (Federal Ministry for Economic Affairs and Energy). This dataset is publicly available for research causes with the idea in mind to assist in the subsidy control of agricultural self-declarations.

3 papers0 benchmarksHyperspectral images

Imgur5K

Imgur5k is a large-scale handwritten in-the-wild dataset, containing challenging real world handwritten samples from nearly 5K writers. It consists of ~135K handwritten English words from 5K different images. As opposed to existing dataests for OCR which have limited variability in their images, the images in Imgur5K contain a diverse set of styles.

3 papers0 benchmarksImages

EMOVIE

EMOVIE is a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.

3 papers0 benchmarksSpeech

Fishnet Open Images

Fishnet Open Images Database is a large dataset of EM imagery for fish detection and fine-grained categorisation onboard commercial fishing vessels. The dataset consists of 86,029 images containing 34 object classes, making it the largest and most diverse public dataset of fisheries EM imagery to-date. It includes many of the characteristic challenges of EM data: visual similarity between species, skewed class distributions, harsh weather conditions, and chaotic crew activity.

3 papers0 benchmarksImages

Text-to-3D House Model (Text--to--3D House Model)

The dataset contains 2,000 houses, 13,478 rooms and 873 (some rooms have same textures so this number is smaller than the total number of rooms.) texture images with corresponding natural language descriptions. These descriptions are firstly generated from some pre-defined templates and then refined by human workers. The average length of the description is 173.73 and there are 193 unique words. In our experiments, we use 1,600 pairs for training while 400 for testing in the building layout generation. For texture synthesis, we use 503 data for training and 370 data for testing.

3 papers0 benchmarks

RuShiftEval

RuShiftEval is a manually annotated lexical semantic change dataset for Russian. Its novelty is ensured by a single set of target words annotated for their diachronic semantic shifts across three time periods, while the previous work either used only two time periods, or different sets of target words.

3 papers0 benchmarks

rSoccer

rSoccer is an open-source simulator for the IEEE Very Small Size Soccer and the Small Size League optimized for reinforcement learning experiments.

3 papers0 benchmarksEnvironment

VideoMatting108

VideoMatting108 is a large-scale video matting and trimap generation dataset with 80 training and 28 validation foreground video clips with ground-truth alpha mattes.

3 papers0 benchmarksVideos

BarkNet 1.0

23,000 cropped images of tree bark, for 23 species of trees around Quebec City, Canada. The images were captured at a distance between 20-60 cm away from the trunk. Labels include: individual tree ID, its species, and its DBH (diameter at breast height). Pictures were taken with four different devices: Nexus 5, Samsung Galaxy S5, Samsung Galaxy S7, and a Panasonic Lumix DMC-TS5 camera. The dataset is sufficiently large to train a Deep network such as ResNet for species recognition.

3 papers0 benchmarksImages

Epilepsy seizure prediction

The original dataset from the reference consists of 5 different folders, each with 100 files, with each file representing a single subject/person. Each file is a recording of brain activity for 23.6 seconds. The corresponding time-series is sampled into 4097 data points. Each data point is the value of the EEG recording at a different point in time. So we have total 500 individuals with each has 4097 data points for 23.5 seconds.

3 papers1 benchmarks

TrajAir: A General Aviation Trajectory Dataset

This dataset contains aircraft trajectories in an untowered terminal airspace collected over 8 months surrounding the Pittsburgh-Butler Regional Airport [ICAO:KBTP], a single runway GA airport, 10 miles North of the city of Pittsburgh, Pennsylvania. The trajectory data is recorded using an on-site setup that includes an ADS-B receiver. The trajectory data provided spans days from 18 Sept 2020 till 23 Apr 2021 and includes a total of 111 days of data discounting downtime, repairs, and bad weather days with no traffic. Data is collected starting at 1:00 AM local time to 11:00 PM local time. The dataset uses an Automatic Dependent Surveillance-Broadcast (ADS-B) receiver placed within the airport premises to capture the trajectory data. The receiver uses both the 1090 MHz and 978 MHz frequencies to listen to these broadcasts. The ADS-B uses satellite navigation to produce accurate location and timestamp for the targets which is recorded on-site using our custom setup. Weather data during t

3 papers2 benchmarksTabular, Time series

PDEs (Some PDE solutions)

In this dataset, you will find solutions of the following partial differential equations: - Burgers - Kortweg-de-Vries -Newell-Whitehead - Kuramoto-Sivashinsky You will find more info about how these were generated in the supplementary material of the paper: https://arxiv.org/abs/2106.11936

3 papers0 benchmarks
PreviousPage 269 of 1000Next