TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

395 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

395 dataset results

MIMIC PERform Testing Dataset

The MIMIC PERform Testing dataset contains the following physiological signals recorded from 200 critically-ill patients during routine clinical care:

2 papers10 benchmarksBiomedical, Medical, Time series

Norwegian Endurance Athlete ECG Database

Abstract The Norwegian Endurance Athlete ECG Database contains 12-lead ECG recordings from 28 elite athletes from various sports in Norway. All recordings are 10 seconds resting ECGs recorded with a General Electric (GE) MAC VUE 360 electrocardiograph. All ECGs are interpreted with both the GE Marquette SL12 algorithm (version 23 (v243)) and one cardiologist with training in interpretation of athlete's ECG. The data was collected at the University of Oslo in February and March 2020.

2 papers0 benchmarksBiomedical, Medical, Time series

ImDrug

ImDrug is a comprehensive benchmark with an open-source Python library which consists of 4 imbalance settings, 11 AI-ready datasets, 54 learning tasks and 16 baseline algorithms tailored for imbalanced learning. It features modularized components including formulation of learning setting and tasks, dataset curation, standardized evaluation, and baseline algorithms. It also provides an accessible and customizable testbed for problems and solutions spanning a broad spectrum of the drug discovery pipeline such as molecular modeling, drug-target interaction and retrosynthesis.

2 papers0 benchmarksMedical

NSCLC-Radiomics

This collection contains images from 422 non-small cell lung cancer (NSCLC) patients. For these patients pretreatment CT scans, manual delineation by a radiation oncologist of the 3D volume of the gross tumor volume and clinical outcome data are available.

2 papers0 benchmarksImages, Medical

MCSCSet

MCSCSet is a large-scale specialist-annotated dataset, designed for the task of Medical-domain Chinese Spelling Correction that contains about 200k samples. MCSCSet involves: i) extensive real-world medical queries collected from Tencent Yidian, ii) corresponding misspelled sentences manually annotated by medical specialists.

2 papers0 benchmarksMedical, Texts

PulseImpute

PulseImpute is a benchmark for Pulsative Physiological Signal Imputation which includes realistic mHealth missingness models, an extensive set of baselines, and clinically-relevant downstream tasks. It contains 440,953 100 Hz 5-minute ECG waveforms from 32,930 patients

2 papers0 benchmarksMedical

PAX-Ray++ (Projected Anatomy in X-Ray Dataset ++)

The PAX-Ray++ dataset uses pseudo-labeled thorax CTs to enable the segmentation of anatomy in Chest X-Rays. By projecting the CTs to a 2D plane, we gather fine-grained annotated imaages resembling radiographs. It contains 7,377 frontal and lateral view images each with 157 anatomy classes and over 2 million annotated instances.

2 papers0 benchmarksBiomedical, Medical

Drunkard's Dataset

Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. To tackle this issue with a common benchmark, we introduce the Drunkard’s Dataset, a challenging collection of synthetic data targeting visual navigation and reconstruction in deformable environments. This dataset is the first large set of exploratory camera trajectories with ground truth inside 3D scenes where every surface exhibits non-rigid deformations over time. Simulations in realistic 3D buildings lets us obtain a vast amount of data and ground truth labels, including camera poses, RGB images and depth, optical flow and normal maps at high resolution and quality.

2 papers0 benchmarks3D, Medical, RGB-D, Videos

MeDAL Retina Dataset (MeDAL Retina Dataset)

Our primary objective in creating this dataset is to support researchers in the advancement of algorithms for keypoints detection and the pretraining of large models on retinal images using a self-supervised approach. The keypoints in the dataset have been carefully annotated by students from our lab, ensuring meticulous accuracy.

2 papers0 benchmarksImages, Medical

BioVid (BioVid Heat Pain Database)

To advance methods for pain assessment, in particular automatic assessment methods, the BioVid Heat Pain Database was collected in a collaboration of the Neuro-Information Technology group of the University of Magdeburg and the Medical Psychology group of the University of Ulm. In our study, 90 participants were subjected to experimentally induced heat pain in four intensities. To compensate for varying heat pain sensitivities, the stimulation temperatures were adjusted based on the subject-specific pain threshold and pain tolerance. Each of the four pain levels was stimulated 20 times in randomized order. For each stimulus, the maximum temperature was held for 4 seconds. The pauses between the stimuli were randomized between 8-12 seconds. The pain stimulation experiment was conducted twice: once with un-occluded face and once with facial EMG sensors.

2 papers0 benchmarksBiomedical, Medical, Videos

PanCancer Multimodal (HoneyBee)

Dataset Card for The Cancer Genome Atlas (TCGA) Multimodal Dataset <!-- Provide a quick summary of the dataset. -->

2 papers0 benchmarksImages, Medical, Tabular, Texts

The ULS23 Challenge Test Set

The ULS23 test set contains 725 lesions from 284 patients of the Radboudumc and JBZ hospitals in the Netherlands. It is intended to be used to measure the performance of 3D universal lesion segmentation models for Computed Tomography (CT). To prepare the data, radiological reports from both participating institutions where searched using NLP tools identifying patients with measurable target lesions, indicating that these lesions were clinically relevant. A random sample of patients was selected, 56.3% of which were male and with diverse scanner manufacturers. The lesions were annotated in 3D by expert radiologists with over 10 years of experience in reading oncological scans. ULS23 is an open benchmark, and we invite ongoing submissions to advance the development of future ULS models.

2 papers8 benchmarks3D, Images, Medical

National Lung Screening Trial (NLST)

The National Lung Screening Trial (NLST) was a randomized controlled trial conducted by the Lung Screening Study group (LSS) and the American College of Radiology Imaging Network (ACRIN) to determine whether screening for lung cancer with low-dose helical computed tomography (CT) reduces mortality from lung cancer in high-risk individuals relative to screening with chest radiography. Approximately 54,000 participants were enrolled between August 2002 and April 2004. Data collection has ended, and information is complete through December 31, 2009. NLST has the ClinicalTrials.gov registration number NCT00047385.

2 papers1 benchmarks3D, Medical

Duke Lung Nodule Dataset 2024

Background: Lung cancer risk classification is an increasingly important area of research as low-dose thoracic CT screening programs have become standard of care for patients at high risk for lung cancer. There is limited availability of large, annotated public databases for the training and testing of algorithms for lung nodule classification.

2 papers1 benchmarks3D, Biomedical, Images, Medical

MedMNIST-C

MedMNIST-C is an open-source data set collection comprising algorithmically generated corruptions applied to the test sets of the MedMNIST collection following the concept of ImageNet-C. To maintain the integrity of the medical data, we have excluded any weather-dependent corruptions (“Snow”, “Frost”, “Fog”). Hence, each data set in the MedMNIST-C collection comprises 16 different corruptions (12 test corruptions and 4 validation corruptions) spanning 5 severity levels. For further information on the corruptions visit the original GitHub repository of ImageNet-C.

2 papers0 benchmarksBiomedical, Images, Medical

HCP Aging (Lifespan Human Connectome Project Aging)

Lifespan HCP Release 2.0 includes cross-sectional visit 1 (V1) preprocessed structural and functional imaging data, unprocessed V1 imaging data for all included modalities (structural, high-res hippocampal T2, resting state fMRI, task fMRI, diffusion, and ASL), and non-imaging demographic and behavioral assessment data from 725 HCP-Aging (HCP-A, ages 36-100+) healthy participants (22+ TB of data).

2 papers2 benchmarks3D, Images, Medical, Time series

LeukemiaAttri

The LeukemiaAttri dataset is a large-scale, multi-domain collection of microscopy images derived from leukemia patient samples, enriched with detailed morphological information. This dataset comprises a total of 28.9K images (2.4K × 2 × 3 × 2), which were captured using both low-cost and high-cost microscopes at three different resolutions: 10x, 40x, and 100x, utilizing various cameras. In addition to providing location annotations for each white blood cell (WBC), the dataset includes comprehensive morphological attributes for every WBC, enhancing its utility for research and analysis in the field.

2 papers6 benchmarksBiology, Biomedical, Images, Medical

Chest wall lung sound dataset

Annotated audio files (separate combined annotation file) of lung sounds as recorded from various vantage points of the chest wall. The annotation includes the sound type (Insipratory: I, Experiatory: E, Wheezes: W, Crackles: C , N:Normal), the diagnosis as decided by a specialist (Asthma, COPD, BRON, heart failure, lung fibrosis, etc.), and the location on the chest wall from which the recording was taken (Posterior: P Lower: L Left: L Right R, UPPER: U, ANTERIOR: A, MIDDLE: M). The audio file names are coded: 1. Filter type; B: BELL 20-200Hz, Diaphragm 100-500 Hz, Extended range 50-500 Hz. 2. Patient number: P1-P112.

2 papers1 benchmarksAudio, Medical, Time series

WMT 2016 Biomedical (WMT 2016 Biomedical Translation Task)

The Biomedical Translation Shared Task was first introduced at the First Conference of Machine Translation. The task aims to evaluate systems for the translation of biomedical titles and abstracts from scientific publications. The data includes three language pairs (English ↔ Portuguese, English ↔ Spanish, English ↔ French) and two sub-domains of biological sciences and health sciences.

1 papers0 benchmarksMedical, Texts

WMT 2014 Medical (WMT 2014 Medical Translation Task)

The Medical Translation Task of WMT 2014 addresses the problem of domain-specific and genre-specific machine translation. The task is split into two subtasks: summary translation, focused on translation of sentences from summaries of medical articles, and query translation, focused on translation of queries entered by users into medical information search engines. Both subtasks included translation between English and Czech, German, and French, in both directions.

1 papers0 benchmarksMedical, Texts
PreviousPage 13 of 20Next