TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

Extended Agriculture-Vision

Extended Agriculture-Vision dataset comprises two parts:

2 papers0 benchmarks

Large COVID-19 CT scan slice dataset

"We built a large lung CT scan dataset for COVID-19 by curating data from 7 public datasets listed in the acknowledgements. These datasets have been publicly used in COVID-19 diagnosis literature and proven their efficiency in deep learning applications. Therefore, the merged dataset is expected to improve the generalization ability of deep learning methods by learning from all these resources together.

2 papers21 benchmarksImages

PKU (License Plate Detection)

The PKU dataset has almost 4,000 images categorized into five groups (G1-G5) that show different situations. For example, G1 has images of highways during the day with only one car in them. On the other hand, G5 has images of crosswalks during the day or at night with multiple cars and license plates (LPs).

2 papers0 benchmarksImages

RADIOML 2018.01A

RADIOML 2018.01A is a dataset which includes both synthetic simulated channel effects of 24 digital and analog modulation types which has been validated.

2 papers0 benchmarks

SuHiFiMask (Surveillance High-Fidelity Mask)

SuHiFiMask (Surveillance High-Fidelity Mask) extends FAS to real surveillance scenes rather than mimicking low-resolution images and surveillance environments. It contains 10,195 videos from 101 subjects of different age groups, which are collected by 7 mainstream surveillance cameras.

2 papers0 benchmarksVideos

Wikipedia Math Essentials

Contains Wikipedia pages about popular mathematics topics and edges describe the links from one page to another. Features describe the number of daily visits between 2019 and 2021 March.

2 papers0 benchmarks

Code comments in Java, Python, and Pharo

It contains the dataset of class comments extracted from various projects of three programming languages Java, Pharo, and Python

2 papers0 benchmarks

New Plant Diseases Dataset (Image dataset containing different healthy and unhealthy crop leaves.)

This dataset is recreated using offline augmentation from the original dataset. The original dataset can be found on this github repo. This dataset consists of about 87K rgb images of healthy and diseased crop leaves which is categorized into 38 different classes. The total dataset is divided into 80/20 ratio of training and validation set preserving the directory structure. A new directory containing 33 test images is created later for prediction purpose.

2 papers1 benchmarksImages

vReLoc

A total of 18 sequences were collected of various lengths. Since the Velodyne LiDAR, RealSense camera and Vicon motion tracker system run in different frequencies, we synchronized these systems so that the image and LiDAR in each timestamp has the same 6-DoF pose. For the static scenario, there are no moving objects in the scene. For other scenarios, there are people randomly walking in the scene. Sequences 01-10 come from the static environment, sequences 11-15 are the one-person moving scenario, and sequences 16-18 are two-persons moving scenario.

2 papers0 benchmarks

SIMARA (SIMARA: a database for key-value information extraction from full-page handwritten documents)

Description We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries. Finding aids are handwritten documents that contain metadata describing older archives. They are stored in the National Archives of France and are used by archivists to identify and find archival documents.

2 papers5 benchmarksImages, Texts

Industry Biscuit (Cookie) dataset (Industrial style dataset for the anomaly detection)

The Industrial Biscuits (Cookie) dataset is our internal dataset designed for the anomaly detection task, which captures Tarallini biscuits. It contains 1225 samples in four classes with the following structure:

2 papers0 benchmarks

Reverb-WSJ0

Noiseless reverberant dataset using the public WSJ0 corpus and simulated room impulse responses using the PyRoomAcoustics library. Used in: - Speech Enhancement and Dereverberation with Diffusion-based Generative Models, Richter et al., arXiv 2022 - StoRM: A Stochastic Regeneration Model for Speech Enhancement and Dereverberation, Lemercier et al., arXiv 2022 - Analysing Discriminative versus Diffusion-based Generative Models for Speech Restoration, Lemercier et al., ICASSP 2023

2 papers0 benchmarks

ContactArt

ContactArt is a dataset for learning hand-object interaction priors for hand and articulated object pose estimation. The dataset is created using visual teleoperation, where the human operator can directly play within a physical simulator to manipulate the articulated objects. All the object models are from Partnet dataset for the convenience of scaling up. ContactArt can provide accurate annotation, rich hand-object interaction, and contact information.

2 papers0 benchmarks3D, RGB-D

Indigo Mobile

Indigo Mobile is a public dataset of copy detection patterns (CDP) based on DataMatrix modulation.

2 papers0 benchmarks

QDAT Quran Recitation

QDAT data set contains 1500 WAV files along with sound files stored on Excel CSV file format. The sound file contains links to the WAV files attached with other features: Age, Gender, and the correctness of the recitation of the three recitation rules and the final goal shows the correctness of the whole reading.

2 papers0 benchmarks

MVSep

MVSep is a synthetic dataset for the vocal separation task created by combining random vocal and instrumental samples, publicly available on the internet. The sourced samples were separated into two sets (vocal-only and instrumental-only) and then randomly mixed together. The mixtures may not always sound like a real melody, but they allow for testing audio separation methods. Synth MVSep dataset consists of 100 tracks, each with a duration of exactly one minute and a sample rate of 44.1 kHz.

2 papers0 benchmarksAudio

Pothole Mix (Pothole Mix Semantic Segmentation Dataset for Road Damage Detection and Segmentation)

This dataset for the semantic segmentation of potholes and cracks on the road surface was assembled from 5 other datasets already publicly available, plus a very small addition of segmented images on our part. To speed up the labeling operations, we started working with depth cameras to try to automate, to some extent, this extremely time-consuming phase.

2 papers8 benchmarks

pinkeggs

We introduce a novel dataset consisting of images depicting pink eggs that have been identified as Pomacea canaliculata eggs, accompanied by corresponding bounding box annotations. The purpose of this dataset is to aid researchers in the analysis of the spread of Pomacea canaliculata species by utilizing deep learning techniques, as well as supporting other investigative pursuits that require visual data pertaining to the eggs of Pomacea canaliculata. It is worth noting, however, that the identity of the eggs in question is not definitively established, as other species within the same taxonomic family have been observed to lay similar-looking eggs in regions of the Americas. Therefore, a crucial prerequisite to any decision regarding the elimination of these eggs would be to establish with certainty whether they are exclusively attributable to invasive Pomacea canaliculata or if other species are also involved. The dataset is available at https://www.kaggle.com/datasets/deeshenzhen/pi

2 papers0 benchmarks

CWD30 (Crop Weed Dataset 30 species)

CWD30 comprises over 219,770 high-resolution images of 20 weed species and 10 crop species, encompassing various growth stages, multiple viewing angles, and environmental conditions. The images were collected from diverse agricultural fields across different geographic locations and seasons, ensuring a representative dataset.

2 papers0 benchmarksImages

ATMs fault prediction

The collected dataset consists of multivariate time series (MTS) data belonging to several ATMs banking along with the annotations that the operators did when they performed a maintenance task on any of the machines.

2 papers0 benchmarksTabular, Time series
PreviousPage 337 of 1000Next