Datasets

395 machine learning datasets

395 dataset results

MoNuSeg

The dataset for this challenge was obtained by carefully annotating tissue images of several patients with tumors of different organs and who were diagnosed at multiple hospitals. This dataset was created by downloading H&E stained tissue images captured at 40x magnification from TCGA archive. H&E staining is a routine protocol to enhance the contrast of a tissue section and is commonly used for tumor assessment (grading, staging, etc.). Given the diversity of nuclei appearances across multiple organs and patients, and the richness of staining protocols adopted at multiple hospitals, the training datatset will enable the development of robust and generalizable nuclei segmentation techniques that will work right out of the box.

17 papers7 benchmarksImages, Medical

SUN-SEG-Easy (Unseen)

The SUN-SEG dataset is a high-quality per-frame annotated VPS dataset, which includes 158,690 frames from the famous SUN dataset. It extends the labels with diverse types, i.e., object mask, boundary, scribble, polygon, and visual attribute. It also introduces the pathological information from the original SUN dataset, including pathological classification labels, location information, and shape information.

17 papers7 benchmarksMedical, RGB Video, Videos

SUN-SEG-Hard (Unseen)

17 papers7 benchmarksMedical, RGB Video, Videos

QUILT-1M

Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of similar data in the medical field, specifically in histopathology, has halted similar progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering 1,087 hours of valuable educational histopathology videos from expert clinicians. From YouTube, we curate Quilt: a large-scale vision-language dataset consisting of 768,826 image and text pairs. Quilt was automatically curated using a mixture of models, including large language models), handcrafted algorithms, human knowledge databases, and automatic speech recognition. In comparison, the most comprehensive datasets curated for histopathology amass only around 200K samples. We combine Quilt with datasets, from other sources, including Twitter, research papers, and the internet in general, to create an even larger dat

17 papers0 benchmarksImages, Medical, Texts

IDRiD (Indian Diabetic Retinopathy Image Dataset)

Indian Diabetic Retinopathy Image Dataset (IDRiD) dataset consists of typical diabetic retinopathy lesions and normal retinal structures annotated at a pixel level. This dataset also provides information on the disease severity of diabetic retinopathy and diabetic macular edema for each image. This dataset is perfect for the development and evaluation of image analysis algorithms for early detection of diabetic retinopathy.

16 papers6 benchmarksBiomedical, Images, Medical

SKM-TEA (Stanford Knee MRI with Multi-Task Evaluation)

The SKM-TEA dataset pairs raw quantitative knee MRI (qMRI) data, image data, and dense labels of tissues and pathology for end-to-end exploration and evaluation of the MR imaging pipeline. This 1.6TB dataset consists of raw-data measurements of ~25,000 slices (155 patients) of anonymized patient knee MRI scans, the corresponding scanner-generated DICOM images, manual segmentations of four tissues, and bounding box annotations for sixteen clinically relevant pathologies.

16 papers0 benchmarksImages, MRI, Medical

ISIC 2017 Task 1

The ISIC 2017 dataset was published by the International Skin Imaging Collaboration (ISIC) as a large-scale dataset of dermoscopy images. The Task 1 challenge dataset for lesion segmentation contains 2,000 images for training with ground truth segmentations (2000 binary mask images).

15 papers0 benchmarksImages, Medical

MSK

The MSK dataset is a dataset for lesion recognition from the Memorial Sloan-Kettering Cancer Center. It is used as part of the ISIC lesion recognition challenges.

15 papers0 benchmarksImages, Medical

MedVidQA (Medical Video Question Answering)

The MedVidQA dataset contains the collection of 3, 010 manually created health-related questions and timestamps as visual answers to those questions from trusted video sources, such as accredited medical schools with an established reputation, health institutes, health education, and medical practitioners.

15 papers0 benchmarksMedical, Texts, Videos

REFUGE Challenge (Retinal Fundus Glaucoma Challenge)

REFUGE Challenge provides a data set of 1200 fundus images with ground truth segmentations and clinical glaucoma labels, currently the largest existing one.

14 papers4 benchmarksImages, Medical

SynthRAD2023

Purpose Medical imaging has become increasingly important in diagnosing and treating oncological patients, particularly in radiotherapy. Recent advances in synthetic computed tomography (sCT) generation have increased interest in public challenges to provide data and evaluation metrics for comparing different approaches openly. This paper describes a dataset of brain and pelvis computed tomography (CT) images with rigidly registered cone-beam CT (CBCT) and magnetic resonance imaging (MRI) images to facilitate the development and evaluation of sCT generation for radiotherapy planning.

14 papers0 benchmarks3D, Images, Medical

MMPD (Multi-Domain Mobile Video Physiology Dataset)

The Multi-domain Mobile Video Physiology Dataset (MMPD), comprising 11 hours(1152K frames) of recordings from mobile phones of 33 subjects. The dataset was designed to capture videos with greater representation across skin tone, body motion, and lighting conditions. MMPD is comprehensive with eight descriptive labels and can be used in conjunction with the rPPG-toolbox and PhysBench. MMPD is widely used for rPPG tasks and remote heart rate estimation. To access the dataset, you are supposed to download this data release agreement and request downloading by email.

14 papers0 benchmarksImages, Medical, Time series, Videos

BraTS 2016

BRATS 2016 is a brain tumor segmentation dataset. It shares the same training set as BRATS 2015, which consists of 220 HHG and 54 LGG. Its testing dataset consists of 191 cases with unknown grades. Image Source: https://sites.google.com/site/braintumorsegmentation/home/brats_2016

13 papers0 benchmarksImages, MRI, Medical

BIOMRC

A large-scale cloze-style biomedical MRC dataset. Care was taken to reduce noise, compared to the previous BIOREAD dataset of Pappas et al. (2018).

13 papers2 benchmarksMedical, Texts

BrixIA (BrixIA Covid-19)

BrixIA Covid-19 is a large dataset of CXR images corresponding to the entire amount of images taken for both triage and patient monitoring in sub-intensive and intensive care units during one month (between March 4th and April 4th 2020) of pandemic peak at the ASST Spedali Civili di Brescia, and contains all the variability originating from a real clinical scenario. It includes 4,707 CXR images of COVID-19 subjects, acquired with both CR and DX modalities, in AP or PA projection, and retrieved from the facility RIS-PACS system.

13 papers0 benchmarksImages, Medical

CBIS-DDSM (Curated Breast Imaging Subset of Digital Database for Screening Mammography)

This CBIS-DDSM (Curated Breast Imaging Subset of DDSM) is an updated and standardized version of the Digital Database for Screening Mammography (DDSM) . The DDSM is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information. The scale of the database along with ground truth validation makes the DDSM a useful tool in the development and testing of decision support systems. The CBIS-DDSM collection includes a subset of the DDSM data selected and curated by a trained mammographer. The images have been decompressed and converted to DICOM format. Updated ROI segmentation and bounding boxes, and pathologic diagnosis for training data are also included. A manuscript describing how to use this dataset in detail is available at https://www.nature.com/articles/sdata2017177.

13 papers1 benchmarksMedical

GUE (Genome Understanding Evaluation)

A collection of $28$ datasets across $7$ tasks constructed for genome language model evaluation. Contains seven tasks: promoter prediction. core promoter prediction, splice site prediction, covid variant classification, epigenetic marks prediction, and transcription factor binding sites prediction on human and mouse.

13 papers6 benchmarksMedical, Texts

PreviousPage 5 of 20Next

Datasets

MoNuSeg

SUN-SEG-Easy (Unseen)

SUN-SEG-Hard (Unseen)

QUILT-1M

IDRiD (Indian Diabetic Retinopathy Image Dataset)

SKM-TEA (Stanford Knee MRI with Multi-Task Evaluation)

ISIC 2017 Task 1

MSK

MedVidQA (Medical Video Question Answering)

REFUGE Challenge (Retinal Fundus Glaucoma Challenge)

SynthRAD2023

MMPD (Multi-Domain Mobile Video Physiology Dataset)

BraTS 2016

BIOMRC

BrixIA (BrixIA Covid-19)

CBIS-DDSM (Curated Breast Imaging Subset of Digital Database for Screening Mammography)

GUE (Genome Understanding Evaluation)

ADAM (Adam: automatic detection challenge on age-related macular degeneration)

Hyper-Kvasir Dataset

ACNE04

Datasets

MoNuSeg

SUN-SEG-Easy (Unseen)

SUN-SEG-Hard (Unseen)

QUILT-1M

IDRiD (Indian Diabetic Retinopathy Image Dataset)

SKM-TEA (Stanford Knee MRI with Multi-Task Evaluation)

ISIC 2017 Task 1

MSK

MedVidQA (Medical Video Question Answering)

REFUGE Challenge (Retinal Fundus Glaucoma Challenge)

SynthRAD2023

MMPD (Multi-Domain Mobile Video Physiology Dataset)

BraTS 2016

BIOMRC

BrixIA (BrixIA Covid-19)

CBIS-DDSM (Curated Breast Imaging Subset of Digital Database for Screening Mammography)

GUE (Genome Understanding Evaluation)

ADAM (Adam: automatic detection challenge on age-related macular degeneration)

Hyper-Kvasir Dataset

ACNE04