19,997 machine learning datasets
19,997 dataset results
The SmartLights benchmark from Snipstests the capability of controlling lights in different rooms. It consists of 1660 requests which are split into five partitions for a 5-fold evaluation. A sample command could be: “please change the [bedroom] lights to [red]” or “i’d like the [living room] lights to be at [twelve] percent”
smac+ defense armored scenario with parallel episodic buffer
smac+ defense outnumbered scenario with parallel episodic buffer
smac+ offensive hard scenario with 20 parallel episodic buffer.
smac+ offensive scenario with 20 parallel episodic buffer.
Nearly 10,000 km² of free high-resolution and paired multi-temporal low-resolution satellite imagery of unique locations which ensure stratified representation of all types of land-use across the world: from agriculture to ice caps, from forests to multiple urbanization densities.
A dataset for fine-grained entity typing of knowledge graph entities built from Freebase. It can be used to evaluate entity representations and also mention-level entity typing.
LightStage is a multi-view dataset, which is proposed in NeuralBody. This dataset captures multiple dynamic human videos using a multi-camera system that has 20+ synchronized cameras. The humans perform complex motions, including twirling, Taichi, arm swings, warmup, punching, and kicking. We provide the SMPL-X parameters recovered with EasyMocap, which contain the motions of body, hand, and face.
HANNA, a large annotated dataset of Human-ANnotated NArratives for Automatic Story Generation (ASG) evaluation, has been designed for the benchmarking of automatic metrics for ASG. HANNA contains 1,056 stories generated from 96 prompts from the WritingPrompts dataset. Each prompt is linked to a human story and to 10 stories generated by different ASG systems. Each story was annotated on six human criteria (Relevance, Coherence, Empathy, Surprise, Engagement and Complexity) by three raters. HANNA also contains the scores produced by 72 automatic metrics on each story.
The goal of REFUGE2 challenge is to evaluate and compare automated algorithms for glaucoma detection and optic disc/cup segmentation on a standard dataset of retinal fundus images. We invite the medical image analysis community to participate by developing and testing existing and novel automated classification and segmentation methods.
The dataset is maintained by VISION AND IMAGE PROCESSING LAB, University of Waterloo. The images of the dataset were extracted from the public databases DermIS and DermQuest, along with manual segmentations of the lesions.
The NCT-CRC-HE-100K dataset is a set of 100,000 non-overlapping image patches extracted from 86 H$\&$E stained human cancer tissue slides and normal tissue from the NCT biobank (National Center for Tumor Diseases) and the UMM pathology archive (University Medical Center Mannheim). While the dataset Colorectal Cacner-Validation-Histology-7K (CRC-VAL-HE-7K) consist of 7180 images extracted from 50 patients with colorectal adenocarcinoma and were used to create a dataset that does not overlap with patients in the NCT-CRC-HE-100K dataset. It was created by pathologists by manually delineating tissue regions in whole slide images into the following nine tissue classes: Adipose (ADI), background (BACK), debris (DEB), lymphocytes (LYM), mucus (MUC), smooth muscle (MUS), normal colon mucosa (NORM), cancer-associated stroma (STR), colorectal adenocarcinoma epithelium (TUM).
CLEVR-Math is a multi-modal math word problems dataset consisting of simple math word problems involving addition/subtraction, represented partly by a textual description and partly by an image illustrating the scenario. These word problems requires a combination of language, visual and mathematical reasoning.
Europarl-ASR (EN) is a 1300-hour English-language speech and text corpus of parliamentary debates for (streaming) Automatic Speech Recognition training and benchmarking, speech data filtering and speech data verbatimization, based on European Parliament speeches and their official transcripts (1996-2020). Includes dev-test sets for streaming ASR benchmarking, made up of 18 hours of manually revised speeches. The availability of manual non-verbatim and verbatim transcripts for dev-test speeches makes this corpus also useful for the assessment of automatic filtering and verbatimization techniques. The corpus is released under an open licence at https://www.mllp.upv.es/europarl-asr/
Meta Album is a meta-dataset created for few-shot learning, meta-learning, continual learning and so on. Meta Album consists of 40 datasets from 10 unique domains. Datasets are arranged in sets (10 datasets, one dataset from each domain). It is a continuously growing meta-dataset.
EUR-Lex-Sum is a dataset for cross-lingual summarization. It is based on manually curated document summaries of legal acts from the European Union law platform. Documents and their respective summaries exist as crosslingual paragraph-aligned data in several of the 24 official European languages, enabling access to various cross-lingual and lower-resourced summarization setups. The dataset contains up to 1,500 document/summary pairs per language, including a subset of 375 cross-lingually aligned legal acts with texts available in all 24 languages.
Phee is a dataset for pharmacovigilance comprising over 5000 annotated events from medical case reports and biomedical literature. It is designed for biomedical event extraction tasks.
Accurate lesion segmentation is critical in stroke rehabilitation research for the quantification of lesion burden and accurate image processing. Current automated lesion segmentation methods for T1-weighted (T1w) MRIs, commonly used in rehabilitation research, lack accuracy and reliability. Manual segmentation remains the gold standard, but it is time-consuming, subjective, and requires significant neuroanatomical expertise. However, many methods developed with ATLAS v1.2 report low accuracy, are not publicly accessible or are improperly validated, limiting their utility to the field. Here we present ATLAS v2.0 (N=1271), a larger dataset of T1w stroke MRIs and manually segmented lesion masks that includes training (public. n=655), test (masks hidden, n=300), and generalizability (completely hidden, n=316) data. Algorithm development using this larger sample should lead to more robust solutions, and the hidden test and generalizability datasets allow for unbiased performance evaluation
Human3.6M 3D WholeBody (H3WB) is a large scale dataset with 133 whole-body keypoint annotations on 100K images, made possible by a new multi-view pipeline. It is designed for the three new tasks : i) 3D whole-body pose lifting from 2D complete whole-body pose, ii) 3D whole-body pose lifting from 2D incomplete whole-body pose, iii) 3D whole-body pose estimation from a single RGB image.
Cityscapes-DVPS is derived from Cityscapes-VPS by adding re-computed depth maps from Cityscapes dataset. Cityscapes-DVPS is distributed under Creative Commons Attribution-NonCommercial-ShareAlike license.