19,997 machine learning datasets
19,997 dataset results
Simulation results of time-respecting and time-ignoring horizon of code review network at Microsoft as JSON. For further details, please look at https://github.com/michaeldorner/only-time-will-tell
325 word images intended for font recognition, whose fonts are included in VFR-447 (and VFR-2420).
Automated measurement of fetal head circumference using 2D ultrasound images
This is a subset of Kinetics-400, introduced in Look, Listen and Learn by Relja Arandjelovic and Andrew Zisserman.
A version of the CMU Movie Summary Corpus (http://www.cs.cmu.edu/~ark/personas/), which was originally scraped from plot summaries from Wikipedia, with some cleaning and sentences turned into events & sorted into "genres" (via LDA).
mini-ImageNet was proposed by Matching networks for one-shot learning for few-shot learning evaluation, in an attempt to have a dataset like ImageNet while requiring fewer resources. Similar to the statistics for CIFAR-100-LT with an imbalance factor of 100, we construct a long-tailed variant of mini-ImageNet that features all the 100 classes and an imbalanced training set with $N_1 = 500$ and $N_K = 5$ images. For evaluation, both the validation and test sets are balanced and contain 10K images, 100 samples for each of the 100 categories.
ARCENE was obtained by merging three mass-spectrometry datasets to obtain enough training and test data for a benchmark. The original features indicate the abundance of proteins in human sera having a given mass value. Based on those features one must separate cancer patients from healthy patients. We added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized.
MultiSV is a corpus designed for training and evaluating text-independent multi-channel speaker verification systems. It can be readily used also for experiments with dereverberation, denoising, and speech enhancement.
CAR contains visual attributes for objects in the Cityscapes dataset. For each object in an image, we have a list of attributes that depend on the category of the object. For instance, a vehicle category has a visibility attribute while a pedestrian has an activity attribute (walking, standing, etc.).
The dataset consists of biomedical articles describing randomized control trials (RCTs) that compare multiple treatments. Each of these articles will have multiple questions, or 'prompts' associated with them. These prompts will ask about the relationship between an intervention and comparator with respect to an outcome, as reported in the trial. For example, a prompt may ask about the reported effects of aspirin as compared to placebo on the duration of headaches.
CytoImageNet is a large-scale pretraining dataset of microscopy images (890K, 894 classes). In the paper, CytoImageNet pretraining yielded features competitive to and different from ImageNet pretrained features on downstream microscopy tasks.
An object-centric dataset consiting of 52 RGB sequences of cars
The MusicBrainz20K dataset for entity resolution and entity clustering is based on real records about songs from the MusicBrainz database. Each record is described with the following attributes: artist, title, album, year and length. The records have been modified with the DAPO [1] data generator. The generated dataset consists of five sources and approximately 20K records describing 10K unique song entities. It contains duplicates for 50% of the original records in two to five sources which are generated with a high degree of corruption to stress-test the entity resolution and clustering approaches.
SSD (Sub-slot Dialog) dataset: This is the dataset for the ACL 2022 paper "A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots". arxiv
SuperCaustics is a simulation tool made in Unreal Engine for generating massive computer vision datasets that include transparent objects.
This repository contains processed data and result files for the paper "Revealing drivers and risks for power grid frequency stability with explainable AI".
The MultiviewC dataset mainly contributes to multiview cattle action recognition, 3D objection detection and tracking. We build a novel synthetic dataset MultiviewC through UE4 based on real cattle video dataset which is offered by CISRO. The format of our data set has been adjusted on the basis of MultiviewX for set-up, annotation and files structure.
Vision-based Fallen Person (VFP290K) is a novel, large-scale dataset for the detection of fallen persons composed of fallen person images collected in various real-world scenarios. VFP290K consists of 294,714 frames of fallen persons extracted from 178 videos, including 131 scenes in 49 locations.
A collection of 385,705 scientific abstracts about Cognitive Control and their GPT-3 embeddings.
The dataset contains patches of facial reflectance as described in the paper, namely the diffuse albedo, diffuse normals, specular albedo, specular normals, as well as the shape in UV space. For the shape, reconstructed meshes have been registered to a common topology and the XYZ values of the points have been mapped to the RGB in UV coordinates and interpolated to complete the UV map. From the complete UV maps of 6144x4096 pixels, patches of 512x512 pixels have been sampled. The dataset contains 7500 such patches (1500 of each datatype) that are anonymized, randomized and sampled so that they do not contain identifiable features.