Datasets

71 machine learning datasets

71 dataset results

PROTEINS

PROTEINS is a dataset of proteins that are classified as enzymes or non-enzymes. Nodes represent the amino acids and two nodes are connected by an edge if they are less than 6 Angstroms apart.

371 papers6 benchmarksBiology, Graphs

COVID-19 Image Data Collection

Contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it a necessary resource to develop and evaluate tools to aid in the treatment of COVID-19.

35 papers1 benchmarksBiology, Biomedical, Images, Medical

MHIST (Minimalist Histopathology image analysis dataset)

The minimalist histopathology image analysis dataset (MHIST) is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each with a gold-standard label determined by the majority vote of seven board-certified gastrointestinal pathologists. MHIST also includes each image’s annotator agreement level. As a minimalist dataset, MHIST occupies less than 400 MB of disk space, and a ResNet-18 baseline can be trained to convergence on MHIST in just 6 minutes using approximately 3.5 GB of memory on a NVIDIA RTX 3090. As example use cases, the authors use MHIST to study natural questions that arise in histopathology image classification such as how dataset size, network depth, transfer learning, and high-disagreement examples affect model performance.

28 papers1 benchmarksBiology, Images

Yeast

Yeast dataset consists of a protein-protein interaction network. Interaction detection methods have led to the discovery of thousands of interactions between proteins, and discerning relevance within large-scale data sets is important to present-day biology.

18 papers0 benchmarksBiology, Graphs

LIVECell (Label-free In Vitro image Examples of Cells)

The LIVECell (Label-free In Vitro image Examples of Cells) dataset is a large-scale microscopic image dataset for instance-segmentation of individual cells in 2D cell cultures.

18 papers10 benchmarksBiology, Biomedical, Images

FLIP (Fitness Landscape Inference for Proteins)

FLIP includes several benchmark datasets that contain a variety of protein sequences, each with a real-valued label indicating its "fitness" (how well the protein performs some particular function). The goal is to predict the fitness of a given protein sequence using the sequence. Different representations of protein sequences (e.g. learned embeddings from large language models) may prove helpful here.

11 papers0 benchmarksBiology

MIPE (Improving Paratope and Epitope Prediction by Multi-Modal Contrastive Learning and Interaction Informativeness Estimation)

Datasets. From the publicly accessible Structural Antibody Database (SAbDab), we collected a total of 7571 antibodyantigen complexes, with the sequence data in FASTA format and structural data in PDB format. Following previous studies [Pittala and Bailey-Kellogg, 2020], we used CD-HIT [Li and Godzik, 2006] to remove high-homology antibody and antigen sequences with the thresholds of 95% and 90% sequence identity, respectively. Subsequently, we excluded antibodies and antigens with any residue type rather than 20 naturally occurring types. Finally, we compiled a dataset consisting of 626 binding antibody-antigen pairs, including their sequences, structures, and corresponding interaction maps. Noteworthy, antibodies primarily bind to antigens through their CDR regions. Most researchers use Euclidean distance to define paratopes and epitopes, and we follow the usual way in our dataset: within the CDR regions/antigen, a residue is labeled as a paratope/epitope if the Euclidean distance bet

7 papers4 benchmarksBiology, Biomedical

2D Hela

2D HeLa is a dataset of fluorescence microscopy images of HeLa cells stained with various organelle-specific fluorescent dyes. The images include 10 organelles, which are DNA (Nuclei), ER (Endoplasmic reticulum), Giantin, (cis/medial Golgi), GPP130 (cis Golgi), Lamp2 (Lysosomes), Mitochondria, Nucleolin (Nucleoli), Actin, TfR (Endosomes), Tubulin. The purpose of the dataset is to train a computer program to automatically identify sub-cellular organelles.

6 papers0 benchmarksBiology, Images

RxRx1

RxRx1 is a biological dataset designed specifically for the systematic study of batch effect correction methods. The dataset consists of 125,510 high-resolution fluorescence microscopy images of human cells under 1,138 genetic perturbations in 51 experimental batches across 4 cell types.

6 papers0 benchmarksBiology, Images

CBC (Complete Blood Count)

The complete blood count (CBC) dataset contains 360 blood smear images along with their annotation files splitting into Training, Testing, and Validation sets. The training folder contains 300 images with annotations. The testing and validation folder both contain 60 images with annotations. We have done some modifications over the original dataset to prepare this CBC dataset where some of the image annotation files contain very low red blood cells (RBCs) than actual and one annotation file does not include any RBC at all although the cell smear image contains RBCs. So, we clear up all the fallacious files and split the dataset into three parts. Among the 360 smear images, 300 blood cell images with annotations are used as the training set first, and then the rest of the 60 images with annotations are used as the testing set. Due to the shortage of data, a subset of the training set is used to prepare the validation set which contains 60 images with annotations.

5 papers0 benchmarksBiology, Biomedical, Images, Medical

NucMM

NucMM is a dataset for segmenting 3D cell nuclei from microscopy image volumes that pushes the task forward to the sub-cubic millimeter scale. It consists of two fully annotated volumes: one electron microscopy (EM) volume containing nearly the entire zebrafish brain with around 170,000 nuclei; and one micro-CT (uCT) volume containing part of a mouse visual cortex with about 7,000 nuclei.

5 papers0 benchmarksBiology

BB-norm-habitat (Bacteria Biotope - entity normalization - bacterial habitat)

In the BB-norm modality of this task, participant systems had to normalize textual entity mentions according to the OntoBiotope ontology for habitats. See BB-dataset for more information.

5 papers0 benchmarksBiology, Texts

BB-norm-phenotype (Bacteria Biotope - entity normalization - phenotype)

In the BB-norm modality of this task, participant systems had to normalize textual entity mentions according to the OntoBiotope ontology for phenotypes. See BB-dataset for more information.

5 papers0 benchmarksBiology, Texts

PECAN (Paratope-Epitope Complexes for Antibody Networks (PECAN))

The PECAN dataset provides structural data for antibody-antigen interactions, specifically curated for paratope and epitope binding site prediction. It includes a diverse set of antibody-antigen complexes, ensuring a well-balanced and representative dataset for training and evaluating deep learning models in protein-protein interaction (PPI) tasks.

5 papers4 benchmarksBiology

CausalBench

CausalBench is a comprehensive benchmark suite for evaluating network inference methods on large-scale perturbational single-cell gene expression data. CausalBench introduces several biologically meaningful performance metrics and operates on two large, curated and openly available benchmark data sets for evaluating methods on the inference of gene regulatory networks from single-cell data generated under perturbations. The datasets consists of over 200000 training samples under interventions.

4 papers0 benchmarksBiology

Multi-Label Classification Dataset Repository

For each dataset we provide a short description as well as some characterization metrics. It includes the number of instances (m), number of attributes (d), number of labels (q), cardinality (Card), density (Dens), diversity (Div), average Imbalance Ratio per label (avgIR), ratio of unconditionally dependent label pairs by chi-square test (rDep) and complexity, defined as m × q × d as in [Read 2010]. Cardinality measures the average number of labels associated with each instance, and density is defined as cardinality divided by the number of labels. Diversity represents the percentage of labelsets present in the dataset divided by the number of possible labelsets. The avgIR measures the average degree of imbalance of all labels, the greater avgIR, the greater the imbalance of the dataset. Finally, rDep measures the proportion of pairs of labels that are dependent at 99% confidence. A broader description of all the characterization metrics and the used partition methods are described in

4 papers0 benchmarksAudio, Biology, Images, Medical, Music, Texts, Videos

BIOSCAN-5M

As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, we present the BIOSCAN-5M Insect dataset to the machine learning community. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, geographical information, and specimen size.

4 papers0 benchmarksBiology, Images

MassSpecGym (MassSpecGym: A benchmark for the discovery and identification of molecules)

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra:

4 papers28 benchmarksBiology

FOBIE (Focused Open Biological Information Extraction)

The Focused Open Biology Information Extraction (FOBIE) dataset aims to support IE from Computer-Aided Biomimetics. The dataset contains ~1,500 sentences from scientific biological texts. These sentences are annotated with TRADE-OFFS and syntactically similar relations between unbounded arguments, as well as argument-modifiers.

3 papers0 benchmarksBiology, Texts

PWDB (Pulse Wave Database)

Overview This database of simulated arterial pulse waves is designed to be representative of a sample of pulse waves measured from healthy adults. It contains pulse waves for 4,374 virtual subjects, aged from 25-75 years old (in 10 year increments). The database contains a baseline set of pulse waves for each of the six age groups, created using cardiovascular properties (such as heart rate and arterial stiffness) which are representative of healthy subjects at each age group. It also contains 728 further virtual subjects at each age group, in which each of the cardiovascular properties are varied within normal ranges. This allows for extensive in silico analyses of haemodynamics and the performance of pulse wave analysis algorithms.

3 papers0 benchmarksBiology, Biomedical, Medical, Time series

Page 1 of 4Next