395 machine learning datasets
395 dataset results
The MIMIC PERform Testing dataset contains the following physiological signals recorded from 200 critically-ill patients during routine clinical care:
Abstract The Norwegian Endurance Athlete ECG Database contains 12-lead ECG recordings from 28 elite athletes from various sports in Norway. All recordings are 10 seconds resting ECGs recorded with a General Electric (GE) MAC VUE 360 electrocardiograph. All ECGs are interpreted with both the GE Marquette SL12 algorithm (version 23 (v243)) and one cardiologist with training in interpretation of athlete's ECG. The data was collected at the University of Oslo in February and March 2020.
ImDrug is a comprehensive benchmark with an open-source Python library which consists of 4 imbalance settings, 11 AI-ready datasets, 54 learning tasks and 16 baseline algorithms tailored for imbalanced learning. It features modularized components including formulation of learning setting and tasks, dataset curation, standardized evaluation, and baseline algorithms. It also provides an accessible and customizable testbed for problems and solutions spanning a broad spectrum of the drug discovery pipeline such as molecular modeling, drug-target interaction and retrosynthesis.
This collection contains images from 422 non-small cell lung cancer (NSCLC) patients. For these patients pretreatment CT scans, manual delineation by a radiation oncologist of the 3D volume of the gross tumor volume and clinical outcome data are available.
MCSCSet is a large-scale specialist-annotated dataset, designed for the task of Medical-domain Chinese Spelling Correction that contains about 200k samples. MCSCSet involves: i) extensive real-world medical queries collected from Tencent Yidian, ii) corresponding misspelled sentences manually annotated by medical specialists.
PulseImpute is a benchmark for Pulsative Physiological Signal Imputation which includes realistic mHealth missingness models, an extensive set of baselines, and clinically-relevant downstream tasks. It contains 440,953 100 Hz 5-minute ECG waveforms from 32,930 patients
The PAX-Ray++ dataset uses pseudo-labeled thorax CTs to enable the segmentation of anatomy in Chest X-Rays. By projecting the CTs to a 2D plane, we gather fine-grained annotated imaages resembling radiographs. It contains 7,377 frontal and lateral view images each with 157 anatomy classes and over 2 million annotated instances.
Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. To tackle this issue with a common benchmark, we introduce the Drunkard’s Dataset, a challenging collection of synthetic data targeting visual navigation and reconstruction in deformable environments. This dataset is the first large set of exploratory camera trajectories with ground truth inside 3D scenes where every surface exhibits non-rigid deformations over time. Simulations in realistic 3D buildings lets us obtain a vast amount of data and ground truth labels, including camera poses, RGB images and depth, optical flow and normal maps at high resolution and quality.
Our primary objective in creating this dataset is to support researchers in the advancement of algorithms for keypoints detection and the pretraining of large models on retinal images using a self-supervised approach. The keypoints in the dataset have been carefully annotated by students from our lab, ensuring meticulous accuracy.
To advance methods for pain assessment, in particular automatic assessment methods, the BioVid Heat Pain Database was collected in a collaboration of the Neuro-Information Technology group of the University of Magdeburg and the Medical Psychology group of the University of Ulm. In our study, 90 participants were subjected to experimentally induced heat pain in four intensities. To compensate for varying heat pain sensitivities, the stimulation temperatures were adjusted based on the subject-specific pain threshold and pain tolerance. Each of the four pain levels was stimulated 20 times in randomized order. For each stimulus, the maximum temperature was held for 4 seconds. The pauses between the stimuli were randomized between 8-12 seconds. The pain stimulation experiment was conducted twice: once with un-occluded face and once with facial EMG sensors.
Dataset Card for The Cancer Genome Atlas (TCGA) Multimodal Dataset <!-- Provide a quick summary of the dataset. -->
The ULS23 test set contains 725 lesions from 284 patients of the Radboudumc and JBZ hospitals in the Netherlands. It is intended to be used to measure the performance of 3D universal lesion segmentation models for Computed Tomography (CT). To prepare the data, radiological reports from both participating institutions where searched using NLP tools identifying patients with measurable target lesions, indicating that these lesions were clinically relevant. A random sample of patients was selected, 56.3% of which were male and with diverse scanner manufacturers. The lesions were annotated in 3D by expert radiologists with over 10 years of experience in reading oncological scans. ULS23 is an open benchmark, and we invite ongoing submissions to advance the development of future ULS models.
The National Lung Screening Trial (NLST) was a randomized controlled trial conducted by the Lung Screening Study group (LSS) and the American College of Radiology Imaging Network (ACRIN) to determine whether screening for lung cancer with low-dose helical computed tomography (CT) reduces mortality from lung cancer in high-risk individuals relative to screening with chest radiography. Approximately 54,000 participants were enrolled between August 2002 and April 2004. Data collection has ended, and information is complete through December 31, 2009. NLST has the ClinicalTrials.gov registration number NCT00047385.
Background: Lung cancer risk classification is an increasingly important area of research as low-dose thoracic CT screening programs have become standard of care for patients at high risk for lung cancer. There is limited availability of large, annotated public databases for the training and testing of algorithms for lung nodule classification.
MedMNIST-C is an open-source data set collection comprising algorithmically generated corruptions applied to the test sets of the MedMNIST collection following the concept of ImageNet-C. To maintain the integrity of the medical data, we have excluded any weather-dependent corruptions (“Snow”, “Frost”, “Fog”). Hence, each data set in the MedMNIST-C collection comprises 16 different corruptions (12 test corruptions and 4 validation corruptions) spanning 5 severity levels. For further information on the corruptions visit the original GitHub repository of ImageNet-C.
Lifespan HCP Release 2.0 includes cross-sectional visit 1 (V1) preprocessed structural and functional imaging data, unprocessed V1 imaging data for all included modalities (structural, high-res hippocampal T2, resting state fMRI, task fMRI, diffusion, and ASL), and non-imaging demographic and behavioral assessment data from 725 HCP-Aging (HCP-A, ages 36-100+) healthy participants (22+ TB of data).
The LeukemiaAttri dataset is a large-scale, multi-domain collection of microscopy images derived from leukemia patient samples, enriched with detailed morphological information. This dataset comprises a total of 28.9K images (2.4K × 2 × 3 × 2), which were captured using both low-cost and high-cost microscopes at three different resolutions: 10x, 40x, and 100x, utilizing various cameras. In addition to providing location annotations for each white blood cell (WBC), the dataset includes comprehensive morphological attributes for every WBC, enhancing its utility for research and analysis in the field.
Annotated audio files (separate combined annotation file) of lung sounds as recorded from various vantage points of the chest wall. The annotation includes the sound type (Insipratory: I, Experiatory: E, Wheezes: W, Crackles: C , N:Normal), the diagnosis as decided by a specialist (Asthma, COPD, BRON, heart failure, lung fibrosis, etc.), and the location on the chest wall from which the recording was taken (Posterior: P Lower: L Left: L Right R, UPPER: U, ANTERIOR: A, MIDDLE: M). The audio file names are coded: 1. Filter type; B: BELL 20-200Hz, Diaphragm 100-500 Hz, Extended range 50-500 Hz. 2. Patient number: P1-P112.
The Biomedical Translation Shared Task was first introduced at the First Conference of Machine Translation. The task aims to evaluate systems for the translation of biomedical titles and abstracts from scientific publications. The data includes three language pairs (English ↔ Portuguese, English ↔ Spanish, English ↔ French) and two sub-domains of biological sciences and health sciences.
The Medical Translation Task of WMT 2014 addresses the problem of domain-specific and genre-specific machine translation. The task is split into two subtasks: summary translation, focused on translation of sentences from summaries of medical articles, and query translation, focused on translation of queries entered by users into medical information search engines. Both subtasks included translation between English and Czech, German, and French, in both directions.