Datasets

71 machine learning datasets

71 dataset results

Drosophila Immunity Time-Course Data

The data used for all results in this paper can be found here. This directory contains:

1 papers0 benchmarksBiology, Tabular, Time series

Extended heartSeg

The dataset X of this work is an extension of the heartSeg dataset. Each sample x ∈ X is an RGB image capturing the heart region of Medaka (Oryzias latipes) hatchlings from a constant ventral view. Since the body of Medaka is see-through, noninvasive studies regarding the internal organs and the whole circulatory system are practicable. A Medaka’s heart contains three parts: the atrium, the ventricle, and the bulbus. The atrium receives deoxygenated blood from the circulatory system and delivers it to the ventricle, which forwards it into the bulbus. The bulbus is the heart’s exit chamber and provides the gill arches with a constant blood flow. The blood flow through these three chambers was captured in 63 short recordings (around 11 seconds with 24 frames per second each) in total, from which the single image samples x ∈ X are extracted. The dataset is split into training and test data following the heartSeg dataset with ntrain = 565 samples in the training set Xtrain and ntest = 165

1 papers1 benchmarksBiology, Biomedical, Medical, Videos

Marine Microalgae Detection in Microscopy Images

Marine Microalgae Detection in Microscopy Images dataset contains a total number of images in the dataset is 937 and all the objects in these images were annotated. The total number of annotated objects is 4201. The training set contains 537 images and the testing set contains 430 images.

1 papers0 benchmarksBiology, Images

Datasets for automatic acoustic identification of insects (Orthoptera and Cicadidae)

This dataset contains recordings of 32 sound producing insect species with a total 335 files and a length of 57 minutes. The dataset was compiled for training neural networks to automatically identify insect species while comparing adaptive, waveform-based frontends to conventional mel-spectrogram frontends for audio feature extraction. This work will be submitted for publication in the future and this dataset can be used to replicate the results, as well as other uses. The scripts for audio processing and the machine learning implementations will be published on Github.

1 papers0 benchmarksAudio, Biology, Environment

VISEM-Tracking

VISEM-Tracking is a dataset consisting of 20 video recordings of 30s of spermatozoa with manually annotated bounding-box coordinates and a set of sperm characteristics analyzed by experts in the domain. It is an extension of the previously published VISEM dataset. In addition to the annotated data, unlabeled video clips are provided for easy-to-use access and analysis of the data.

1 papers0 benchmarksBiology, Medical, Videos

PS4

A dataset of 18,731 proteins with their PDB code, index of the first residue in their respective DSSP file, their residue sequence and 9-category secondary structure sequence (including polyproline helices).

1 papers1 benchmarksBiology

PubChem18 (PubChem 2018)

A.2.1 AN OPEN, LARGE-SCALE DATASET FOR ZERO-SHOT DRUG DISCOVERY DERIVED FROM PUBCHEM We constructed a large public dataset extracted from PubChem (Kim et al., 2019; Preuer et al., 2018), an open chemistry database, and the largest collection of readily available chemical data. We take assays ranging from 2004 to 2018-05. It initially comprises 224,290,250 records of molecule-bioassay activity, corresponding to 2,120,854 unique molecules and 21,003 unique bioassays. We find that some molecule-bioassay pairs have multiple activity records, which may not all agree. We reduce every molecule-bioassay pair to exactly one activity measurement by applying majority voting. Molecule-bioassay pairs with ties are discarded. This step yields our final bioactivity dataset, which features 223,219,241 records of molecule-bioassay activity, corresponding to 2,120,811 unique molecules and 21,002 unique bioassays ranging from AID 1 to AID 1259411. Molecules range up to CID 132472079. The dataset has 3 di

1 papers0 benchmarksBiology, Texts

16s rDNA sequencing of feces from C9orf72 loss of function mice

In one round of sequencing, 5 fecal pellets from 2 pro-inflammatory environments (Harvard BRI/Johns Hopkins) and 2 pro-survival environments (Broad Institute/Jackson Labs) were sequenced at the 16s rDNA locus. In a second round of sequencing, 9 fecal pellets from Harvard BRI, 9 fecal pellets from Broad Institute, 6 fecal pellets from Harvard BRI mice transplanted with Harvard BRI feces, and 6 pellets from Harvard BRI mice transplanted with Broad feces were sequenced at the 16S rDNA locus

1 papers0 benchmarksBiology

YIM Dataset (Yeast Cells in Microstructures Dataset)

An instance segmentation dataset of yeast cells in microstructures. The dataset includes 493 densely annotated microscopy images. For more information see the paper "An Instance Segmentation Dataset of Yeast Cells in Microstructures".

1 papers0 benchmarksBiology, Images, Medical

Stained mice brain blood vessels. Confocal-LFM

3D confocal stacks with corresponding 2D Light-field microscope images

1 papers0 benchmarks3D, Biology, Images

ACCT Data Repository (ACCT is a fast and accessible automatic cell counting tool using machine learning for 2D image segmentation)

This dataset is a collection of fluorescent images from mice in order to test an automatic cell counting tool that we developed. 62 images viewed from 2 or 3 different fields of views are shown. In brief, the dataset was derived from brain sections of a model for HIV-induced brain injury (HIVgp120tg), which expresses soluble gp120 envelope protein in astrocytes under the control of a modified GFAP promoter. The mice were in a mixed C57BL/6.129/SJL genetic background, and two genotypes of 9 month old male mice were selected: wild type controls (Resting, n = 3) and transgenic littermates (HIVgp120tg, Activated, n = 3). No randomization was performed. HIVgp120tg mice show among other hallmarks of human HIV neuropathology an increase in microglia numbers which indicates activation of the cells compared to non-transgenic littermate controls.

1 papers0 benchmarksBiology, Biomedical, Images, Medical

FLIP -- AAV, Designed vs mutant (adeno-associated virus)

FLIP includes several benchmark datasets that contain a variety of protein sequences, each with a real-valued label indicating its "fitness" (how well the protein performs some particular function). The goal is to predict the fitness of a given protein sequence using the sequence. Different representations of protein sequences (e.g. learned embeddings from large language models) may prove helpful here.

1 papers0 benchmarksBiology

AI-ready multiplex IHC-IF dataset (AI-ready restained and co-registered multiplex dataset for head-and-neck squamous cell carcinoma)

We introduce a new AI-ready computational pathology dataset containing restained and co-registered digitized images from eight head-and-neck squamous cell carcinoma patients. Specifically, the same tumor sections were stained with the expensive multiplex immunofluorescence (mIF) assay first and then restained with cheaper multiplex immunohistochemistry (mIHC). This is a first public dataset that demonstrates the equivalence of these two staining methods which in turn allows several use cases; due to the equivalence, our cheaper mIHC staining protocol can offset the need for expensive mIF staining/scanning which requires highly skilled lab technicians. As opposed to subjective and error-prone immune cell annotations from individual pathologists (disagreement > 50%) to drive SOTA deep learning approaches, this dataset provides objective immune and tumor cell annotations via mIF/mIHC restaining for more reproducible and accurate characterization of tumor immune microenvironment (e.g. for

1 papers0 benchmarksBiology, Images, Medical

Facial Skeletal angles (Facial Skeletal Angles (Glabella and Maxilla Angle and Length and Width of Piriformis))

Facial Skeletal Angles (Glabella and Maxilla Angle and Length and Width of Piriformis)

1 papers0 benchmarksBiology, Medical

CATH 4.2

The CATH (Class, Architecture, Topology, Homology) [65] database is a comprehensive resource for protein structure classification that hierarchical group proteins based on their structural features. The database defines classes based on topological similarities, architectures based on the arrangement of secondary structure elements, topologies based on the connectivity of secondary structure elements, and homologous domains based on sequence similarity.

1 papers2 benchmarksBiology

The EMBO SourceData-NLP dataset (The SourceData-NLP dataset: integrating curation into scientific publishing for training large language models)

We present the SourceData-NLP dataset produced through the routine curation of papers during the publication process. A unique feature of this dataset is its emphasis on the annotation of bioentities in figure legends. We annotate eight classes of biomedical entities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases), their role in the experimental design, and the nature of the experimental method as an additional class. SourceData-NLP contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 papers in molecular and cell biology. We illustrate the dataset's usefulness by assessing BioLinkBERT and PubmedBERT, two transformers-based models, fine-tuned on the SourceData-NLP dataset for NER. We also introduce a novel context-dependent semantic task that infers whether an entity is the target of a controlled intervention or the object of measurement.

1 papers4 benchmarksBiology, Biomedical, Texts

AUR & UMB dataset (Anticancer Efficacy of Auraptene & Umbelliprenin: In Vitro Viability Dataset)

This dataset contains quantitative data on the anticancer effects of the natural coumarins Auraptene (AUR) and Umbelliprenin (UMB) across 27 studies. The data were collected from published literature reporting the impacts of AUR and UMB treatment on the viability of diverse human cancer cell lines.

1 papers1 benchmarksBiology

TCB-DS (Toxigenic Cyanobacteria Dataset)

The TCB-DS dataset is a specialized collection of microscopic images focusing on the automatic recognition of cyanobacteria genera. This dataset was meticulously compiled to address the challenges associated with the varying image qualities due to differences in contrast, resolution, size, lighting, and the presence of noise in the original images. It includes 2,591 images with varying dimensions, ranging from a minimum of 11 × 41 pixels to a maximum of 5184 × 3456 pixels.

1 papers0 benchmarksBiology, Images

uBench (MicroBench)

Microscopy is a cornerstone of biomedical research, enabling detailed study of biological structures at multiple scales. Advances in cryo-electron microscopy, high-throughput fluorescence microscopy, and whole-slide imaging allow the rapid generation of terabytes of image data, which are essential for fields such as cell biology, biomedical research, and pathology. These data span multiple scales, allowing researchers to examine atomic/molecular, subcellular/cellular, and cell/tissue-level structures with high precision. A crucial first step in microscopy analysis is interpreting and reasoning about the significance of image findings. This requires domain expertise and comprehensive knowledge of biology, normal/abnormal states, and the capabilities and limitations of microscopy techniques. Vision-language models (VLMs) offer a promising solution for large-scale biological image analysis, enhancing researchers’ efficiency, identifying new image biomarkers, and accelerating hypothesis ge

1 papers0 benchmarksBiology, Biomedical, Images, Texts

ENSeg

ENSeg Dataset Overview This dataset represents an enhanced subset of the ENS dataset. The ENS dataset comprises image samples extracted from the enteric nervous system (ENS) of male adult Wistar rats (Rattus norvegicus, albius variety), specifically from the jejunum, the second segment of the small intestine.

1 papers1 benchmarksBiology, Images, Medical

PreviousPage 3 of 4Next