71 machine learning datasets
71 dataset results
By releasing this dataset, we aim at providing a new testbed for computer vision techniques using Deep Learning. The main peculiarity is the shift from the domain of "natural images" proper of common benchmark dataset to biological imaging. We anticipate that the advantages of doing so could be two-fold: i) fostering research in biomedical-related fields - for which popular pre-trained models perform typically poorly - and ii) promoting methodological research in deep learning by addressing peculiar requirements of these images. Possible applications include but are not limited to semantic segmentation, object detection and object counting. The data consist of 283 high-resolution pictures (1600x1200 pixels) of mice brain slices acquired through a fluorescence microscope. The final goal is to individuate and count neurons highlighted in the pictures by means of a marker, so to assess the result of a biological experiment. The corresponding ground-truth labels were generated through a hy
CREMP is a resource generated for the rapid development and evaluation of machine learning models for macrocyclic peptides. CREMP contains 36,198 unique macrocyclical peptides and their high-quality structural ensembles generated using the Conformer-Rotamer Ensemble Sampling Tool (CREST).
The dataset is designed specifically to solve a range of computer vision problems (2D-3D tracking, posture) faced by biologists while designing behavior studies with animals.
The platelet-em dataset contains two 3D scanning electron microscope (EM) images of human platelets, as well as instance and semantic segmentations of those two image volumes. This data has been reviewed by NIBIB, contains no PII or PHI, and is cleared for public release. All files use a multipage uint16 TIF format. A 3D image with size [Z, X, Y] is saved as Z pages of size [X, Y]. Image voxels are approximately 40x10x10 nm
The H01 dataset is a 1.4 petabyte rendering of a small sample of human brain tissue, released by a collaboration between the Lichtman Laboratory at Harvard University and Google. The H01 sample was imaged at 4nm-resolution by serial section electron microscopy, reconstructed and annotated by automated computational techniques, and analyzed for preliminary insights into the structure of the human cortex.
This work was undertaken by members of the Lincoln Centre for Autonomous Systems, University of Lincoln, UK. The four data collection sessions were conducted at three different sites in Lincolnshire, UK and one in Murcia, Spain (see Fig. 1). The sessions were conducted at the beginning and towards the end of harvesting season in UK and at the end of the harvest in Spain. The variety of broccoli plants grown in UK is called Iron Man whilst the variety grown in Spain is called Titanium.The weather during UK data capture included a mixture of different conditions including sunny, overcast and raining with broccoli varying in maturity levels from small to larger to already harvested, while the conditions for data capture in Spain included strong sunlight and mature plants at the very end of the harvesting season. The tractor was driven through the broccoli field at a slow walking speed with two rows of broccoli plants being imaged by the RGB-D sensor.
The dataset represents data generated from a commonly used model in population genetics. It comprises a matrix of 1,000,000 rows and 9 columns, representing parameters and summaries generated by an infinite-sites coalescent model for genetic variation. The first two columns encode the scaled mutation rate (theta) and scaled recombination rate (rho). The subsequent seven columns are data summaries: number of segregating sites (C1), standard uniform random noise acting as a distractor (C2), pairwise mean number of nucleotidic differences (C3), mean $R^2$ across pairs separated by <10% of the simulated genomic regions (C4), number of distinct haplotypes (C5), frequency of the most common haplotype (C6), number of singleton haplotypes (C7).
ATUE is an antibody study benchmark with four real-world supervised tasks covering therapeutic antibody engineering, B cell analysis, and antibody discovery.
Single cortical neurons as deep artificial neural networks This dataset contains training and testing subsets of the input/output relationship of a single cortical layer 5 pyramidal cell (L5PC) neuron at 1ms single spike temporal resolution. The data is obtained via a simulation that contains all of the currently (2021) known and well modeled "messy biological details" that relate to the operation of single neurons in the brain.
ProteinGym is a collection of benchmarks aiming at comparing the ability of models to predict the effects of protein mutations. The benchmarks in ProteinGym are divided according to mutation type (substitutions vs. indels), ground truth source (DMS assay vs. clinical annotation), and training regime (zero-shot vs. supervised).
The LeukemiaAttri dataset is a large-scale, multi-domain collection of microscopy images derived from leukemia patient samples, enriched with detailed morphological information. This dataset comprises a total of 28.9K images (2.4K × 2 × 3 × 2), which were captured using both low-cost and high-cost microscopes at three different resolutions: 10x, 40x, and 100x, utilizing various cameras. In addition to providing location annotations for each white blood cell (WBC), the dataset includes comprehensive morphological attributes for every WBC, enhancing its utility for research and analysis in the field.
ADORE is a benchmark dataset for machine learning for ecotixicology, covering acute aquatic toxicity in three relevant taxonomic groups (fish, crustaceans, and algae). The core dataset describes ecotoxicological experiments and is expanded with phylogenetic and species-specific data on the species as well as chemical properties and molecular representations. Apart from challenging other researchers to try and achieve the best model performances across the whole dataset, we propose specific relevant challenges on subsets of the data and include datasets and splittings corresponding to each of these challenge as well as in-depth characterization and discussion of train-test splitting approaches.
To take advantage of the ever-increasing amount of structural data now available, we also trained Paragraph on a larger dataset. This new dataset was extracted from the Structural Antibody Database (SAbDab, Schneider et al., 2022) on March 31, 2022 and includes 1086 complexes which we divide into train, validation and test sets using a 60-20-20 split. Full details of both datasets are given in the Supplementary Information.
Results of a high-throughput biological assay measuring the stability of proteins https://github.com/Rocklin-Lab/cdna-display-proteolysis-pipeline From the paper "Here we present cDNA display proteolysis, a method for measuring thermodynamic folding stability for up to 900,000 protein domains in a one-week experiment. From 1.8 million measurements in total, we curated a set of around 776,000 high-quality folding stabilities covering all single amino acid variants and selected double mutants of 331 natural and 148 de novo designed protein domains 40–72 amino acids in length. " This assay have some particular limitations such as the small size of amino acids, and the particular way stability is measured which is expected to be correlated to typical assays for thermostability . Its advantage is that it contains orders of magnitude more mutants among existing thermostability datasets (typically 100-1000s mutants for tens of different proteins)
TPM values together with cell type annotations that were obtained from Alex Pollen on 15/10/15
The ARPA-E funded TERRA-REF project is generating open-access reference datasets for the study of plant sensing, genomics, and phenomics. Sensor data were generated by a field scanner sensing platform that captures color, thermal, hyperspectral, and active flourescence imagery as well as three dimensional structure and associated environmental measurements. This dataset is provided alongside data collected using traditional field methods in order to support calibration and validation of algorithms used to extract plot level phenotypes from these datasets.
VesselGraph is a dataset of whole-brain vessel graphs based on specific imaging protocols. Specifically, vascular graphs are extracted using a refined graph extraction scheme leveraging the volume rendering engine Voreen and provided in an accessible and adaptable form through the OGB and PyTorch Geometric dataloaders.
GO21 is a biomedical knowledge graph that models genes, proteins, drugs, and the hierarchy of the biological processes they participate in. It consists of 806,136 triples with 21 relations and 89127 entities. GO21 can be used for knowledge graph completion tasks (link prediction) as well as hierarchical reasoning tasks, such as ancestor-descendant prediction task proposed in the paper.
This publicly available dataset contains 1613 RGB-D images of field-grown broccoli plants. The dataset also includes the polygon and circle annotations of the broccoli heads.
This data set contains weekly scans of cauliflower and broccoli covering a ten week growth cycle from transplant to harvest. The data set includes ground-truth, physical characteristics of the crop; environmental data collected by a weather station and a soil-senor network; and scans of the crop performed by an autonomous agricultural robot, which include stereo colour, thermal and hyperspectral imagery. The crop were planted at Lansdowne Farm, a University of Sydney agricultural research and teaching facility. Lansdowne Farm is located in Cobbitty, a suburb 70km south-west of Sydney in New South Wales (NSW), Australia. Four 80 metre raised crop beds were prepared with a North-South orientation. Approximately 144 Brassica were planted in each bed. Cauliflower were planted in the first and third bed (from west to east). Broccoli were planted in the second and fourth beds.