19,997 machine learning datasets
19,997 dataset results
A benchmark for matching and registration of partial point clouds with time-varying geometry. It is constructed using randomly selected 1761 sequences from DeformingThings4D.
These are the official datasets for the LHC Olympics 2020 Anomaly Detection Challenge. Each "black box" contains 1M events meant to be representative of actual LHC data. These events may include signal(s) and the challenge consists of finding these signals using the method of your choice. We have uploaded a total of THREE black boxes to be used for the challenge.
This dataset is composed of two collections of heartbeat signals derived from two famous PhysioNet datasets in heartbeat classification, the MIT-BIH Arrhythmia Dataset and the PTB Diagnostic ECG Database. The number of samples in both collections is large enough for training a deep neural network.
EEEyeNet is a dataset and benchmark with the goal of advancing research in the intersection of brain activities and eye movements. It consists of simultaneous Electroencephalography (EEG) and Eye-tracking (ET) recordings from 356 different subjects collected from three different experimental paradigms.
a dataset of time-series anomaly detection
BigDatasetGAN is a dataset for pixel-wise ImageNet segmentation. It consists of large synthetic datasets from BigGAN & VQGAN.
Novel benchmark which features aspects of natural scenes, e.g. a complex 3D object and different lighting conditions, while still providing access to the continuous ground-truth factors.
AKB-48 is a large-scale Articulated object Knowledge Base which consists of 2,037 real-world 3D articulated object models of 48 categories.
The gene-disease associations corpus contains 30,192 titles and abstracts from PubMed articles that have been automatically labelled for genes, diseases and gene-disease associations via distant supervision. The test set is comprised of 1000 of these examples. It is common to hold out a random 20% of the examples in the train set as a validation set.
MUGEN is a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun. MUGEN can help progress research in many tasks in multimodal understanding and generation.
We propose a new, scalable video-mining pipeline which transfers captioning supervision from image datasets to video and audio. We use this pipeline to mine paired video and captions, using the Conceptual Captions3M image dataset as a seed dataset. Our resulting dataset VideoCC3M consists of millions of weakly paired clips with text captions and will be released publicly.
Fig-QA consists of 10256 examples of human-written creative metaphors that are paired as a Winograd schema. It can be used to evaluate the commonsense reasoning of models. The metaphors themselves can also be used as training data for other tasks, such as metaphor detection or generation.
MCoNaLa is a multilingual dataset to benchmark code generation from natural language commands extending beyond English. Modeled off of the methodology from the English Code/Natural Language Challenge (CoNALa) dataset, the authors annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
SkillSpan is a dataset for Skill Extraction (SE). It is an important and widely-studied task useful to gain insights into labor market dynamics. However, there is a lacuna of datasets and annotation guidelines; available datasets are few and contain crowd-sourced labels on the span-level or labels from a predefined skill inventory. To address this gap, the authors introduce SkillSpan, a novel SE dataset consisting of 14.5K sentences and over 12.5K annotated spans.
A large-scale human image dataset with over 230K samples capturing diverse poses and textures.
The OU-ISIR Gait Database, Multi-View Large Population Dataset (OU-MVLP) is meant to aid research efforts in the general area of developing, testing and evaluating algorithms for cross-view gait recognition. The Institute of Scientific and Industrial Research (ISIR), Osaka University (OU) has copyright in the collection of gait video and associated data and serves as a distributor of the OU-ISIR Gait Database.
MUStARD++ is a multimodal sarcasm detection dataset (MUStARD) pre-annotated with 9 emotions. It can be used for the task of detecting the emotion in a sarcastic statement.
AnnoMI: A Dataset of Expert-Annotated Counselling Dialogues Dataset Introduction Research on natural language processing approaches to analysing counselling dialogues has seen substantial development in recent years, but access to this area remains extremely limited, due to the lack of publicly available expert-annotated therapy conversations. In this paper, we introduce AnnoMI, the first publicly and freely accessible dataset of professionally transcribed dialogues demonstrating high- and low-quality motivational interviewing (MI), an effective counselling technique, with annotations on key MI aspects by domain experts.
BindingDB is a public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of protein considered to be drug-targets with small, drug-like molecules. As of May 27, 2022, BindingDB contains 41,296 Entries, each with a DOI, containing 2,519,702 binding data for 8,810 protein targets and 1,080,101 small molecules. There are 5,988 protein-ligand crystal structures with BindingDB affinity measurements for proteins with 100% sequence identity, and 11,442 crystal structures allowing proteins to 85% sequence identity.You can also use BindingDB data through the Registry of Open Data on AWS: https://registry.opendata.aws/binding-db. This dataset using the split by TransformerCPI(doi.org/10.1093/bioinformatics/btaa524)
Nocturne is a 2D, partially observed, driving simulator, built in C++ for speed and exported as a Python library.