395 machine learning datasets
395 dataset results
The LIMUC dataset is the largest publicly available labeled ulcerative colitis dataset that compromises 11276 images from 564 patients and 1043 colonoscopy procedures. Three experienced gastroenterologists were involved in the annotation process, and all images are labeled according to the Mayo endoscopic score (MES).
OVQA contains 19,020 medical visual question and answer pairs generated from 2,001 medical images collected from 2,212 EMRs in Orthopedics.
For each dataset we provide a short description as well as some characterization metrics. It includes the number of instances (m), number of attributes (d), number of labels (q), cardinality (Card), density (Dens), diversity (Div), average Imbalance Ratio per label (avgIR), ratio of unconditionally dependent label pairs by chi-square test (rDep) and complexity, defined as m × q × d as in [Read 2010]. Cardinality measures the average number of labels associated with each instance, and density is defined as cardinality divided by the number of labels. Diversity represents the percentage of labelsets present in the dataset divided by the number of possible labelsets. The avgIR measures the average degree of imbalance of all labels, the greater avgIR, the greater the imbalance of the dataset. Finally, rDep measures the proportion of pairs of labels that are dependent at 99% confidence. A broader description of all the characterization metrics and the used partition methods are described in
We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world’s largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country.
A whole-body FDG-PET/CT dataset with manually annotated tumor lesions (FDG-PET-CT-Lesions) 1,014 studies (900 patients)
We designed a baseline wander (BLW) removal benchmark to evaluate various methods using a consistent test set and uniform conditions. Details of the data preprocessing pipeline are heavily based on papers [1]. All 105 signals from the QT Database were resampled from 250 Hz to 360 Hz to align with the NSTDB sampling frequency. Heartbeats were extracted using the annotations provided by specialists. During this process, we identified a small number of incorrect annotations for beat start/end points, leading to cases where two consecutive beats were erroneously merged into one. To address this issue, we discarded beats exceeding 512 samples (1422.22 ms) in length. We designated heartbeats from 14 signals, accounting for 13% of the total signals, as the test set. These signals were selected to include two signals from each of the seven datasets comprising the QT Database, ensuring a diverse representation of pathologies in the test set. This setup provides a more robust evaluation of the g
The ORVS dataset has been newly established as a collaboration between the computer science and visual-science departments at the University of Calgary.
The OCTAGON dataset is a set of Angiography by Octical Coherence Tomography images (OCT-A) used to the segmentation of the Foveal Avascular Zone (FAZ). The dataset includes 144 healthy OCT-A images and 69 diabetic OCT-A images, divided into four groups, each one with 36 and about 17 OCT-A images, respectively. These groups are: 3x3 superficial, 3x3 deep, 6x6 superficial and 6x6 deep, where 3x3 and 6x6 are the zoom of the image and superficial/deep are the depth level of the extracted image. The healthy dataset includes OCT-A images from people classified in 6 age ranges: 10-19 years, 20-29 years, 30-39 years, 40-49 years, 50-59 years and 60-69 years. Each age range includes 3 different patients with information of left and right eyes for each one. Finally, for each eye, there are four different images: one 3x3 superficial image, one 3x3 deep image, one 6x6 superficial image and one 6x6 deep image. Each image have two manual labelled of expert clinicians of the FAZ and their quantificat
LKS is a dataset of 684 Liver-Kidney-Stomach immunofluorescence whole slide images (WSIs) used in the investigation of autoimmune liver disease.
This prostate MRI segmentation dataset is collected from six different data sources.
The US-4 is a dataset of Ultrasound (US) images. It is a video-based image dataset that contains over 23,000 high-resolution images from four US video sub-datasets, where two sub-datasets are newly collected by experienced doctors for this dataset.
The second Ninapro database includes 40 intact subjects and it is thoroughly described in the paper: "Manfredo Atzori, Arjan Gijsberts, Claudio Castellini, Barbara Caputo, Anne-Gabrielle Mittaz Hager, Simone Elsig, Giorgio Giatsidis, Franco Bassetto & Henning Müller. Electromyography data for non-invasive naturally-controlled robotic hand prostheses. Scientific Data, 2014" (http://www.nature.com/articles/sdata201453). Please, cite this paper for any work related to the Ninapro database. Please, use also the paper by Gijsberts et al., 2014 (http://publications.hevs.ch/index.php/publications/show/1629) for more information about the database.
The Individual Brain Charting (IBC) project aims at providing a new generation of functional-brain atlases. To map cognitive mechanisms in a fine scale, task-fMRI data at high-spatial-resolution are being acquired on a fixed cohort of 12 participants, while performing many different tasks. These data—free from both inter-subject and inter-site variability—are publicly available as means to support the investigation of functional segregation and connectivity as well as individual variability with a view to establishing a better link between brain systems and behavior.
This database includes 25 long-term ECG recordings of human subjects with atrial fibrillation (mostly paroxysmal).
Overview This database of simulated arterial pulse waves is designed to be representative of a sample of pulse waves measured from healthy adults. It contains pulse waves for 4,374 virtual subjects, aged from 25-75 years old (in 10 year increments). The database contains a baseline set of pulse waves for each of the six age groups, created using cardiovascular properties (such as heart rate and arterial stiffness) which are representative of healthy subjects at each age group. It also contains 728 further virtual subjects at each age group, in which each of the cardiovascular properties are varied within normal ranges. This allows for extensive in silico analyses of haemodynamics and the performance of pulse wave analysis algorithms.
The “Medico automatic polyp segmentation challenge” aims to develop computer-aided diagnosis systems for automatic polyp segmentation to detect all types of polyps (for example, irregular polyp, smaller or flat polyps) with high efficiency and accuracy. The main goal of the challenge is to benchmark semantic segmentation algorithms on a publicly available dataset, emphasizing robustness, speed, and generalization.
The dataset contains a Video capsule endoscopy dataset for polyp segmentation.
Fetoscopic Placental Vessel Segmentation and Registration (FetReg) is a large-scale multi-centre dataset for the development of generalized and robust semantic segmentation and video mosaicking algorithms for the fetal environment with a focus on creating drift-free mosaics from long duration fetoscopy videos.
The AxonEM dataset consists of two 30x30x30 um^3 EM image volumes from the human and mouse cortex, respectively. It is used for 3D axon instance segmentation of brain cortical regions. The authors proofread over 18,000 axon instances to provide dense 3D axon instance segmentation, enabling large-scale evaluation of axon reconstruction methods. In addition, the authors also densely annotate nine ground truth subvolumes for training, per each data volume.