Datasets

19,997 machine learning datasets

19,997 dataset results

4DMatch

A benchmark for matching and registration of partial point clouds with time-varying geometry. It is constructed using randomly selected 1761 sequences from DeformingThings4D.

10 papers2 benchmarks

LHC Olympics 2020 (LHC Olympics 2020 Anomaly Detection Challenge)

These are the official datasets for the LHC Olympics 2020 Anomaly Detection Challenge. Each "black box" contains 1M events meant to be representative of actual LHC data. These events may include signal(s) and the challenge consists of finding these signals using the method of your choice. We have uploaded a total of THREE black boxes to be used for the challenge.

10 papers0 benchmarks

ECG Heartbeat Categorization Dataset

This dataset is composed of two collections of heartbeat signals derived from two famous PhysioNet datasets in heartbeat classification, the MIT-BIH Arrhythmia Dataset and the PTB Diagnostic ECG Database. The number of samples in both collections is large enough for training a deep neural network.

10 papers0 benchmarksTime series

EEGEyeNet

EEEyeNet is a dataset and benchmark with the goal of advancing research in the intersection of brain activities and eye movements. It consists of simultaneous Electroencephalography (EEG) and Eye-tracking (ET) recordings from 356 different subjects collected from three different experimental paradigms.

10 papers0 benchmarksEEG

SMD (Server Machine Dataset)

a dataset of time-series anomaly detection

10 papers14 benchmarks

BigDatasetGAN

BigDatasetGAN is a dataset for pixel-wise ImageNet segmentation. It consists of large synthetic datasets from BigGAN & VQGAN.

10 papers0 benchmarks

3DIdent

Novel benchmark which features aspects of natural scenes, e.g. a complex 3D object and different lighting conditions, while still providing access to the continuous ground-truth factors.

10 papers1 benchmarksImages

AKB-48

AKB-48 is a large-scale Articulated object Knowledge Base which consists of 2,037 real-world 3D articulated object models of 48 categories.

10 papers0 benchmarks3D

GDA (Gene-Disease Associations Corpus)

The gene-disease associations corpus contains 30,192 titles and abstracts from PubMed articles that have been automatically labelled for genes, diseases and gene-disease associations via distant supervision. The test set is comprised of 1000 of these examples. It is common to hold out a random 20% of the examples in the train set as a validation set.

10 papers3 benchmarks

MUGEN

MUGEN is a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun. MUGEN can help progress research in many tasks in multimodal understanding and generation.

10 papers0 benchmarksAudio, Texts, Videos

VideoCC3M (Video-Conceptual-Captions)

We propose a new, scalable video-mining pipeline which transfers captioning supervision from image datasets to video and audio. We use this pipeline to mine paired video and captions, using the Conceptual Captions3M image dataset as a seed dataset. Our resulting dataset VideoCC3M consists of millions of weakly paired clips with text captions and will be released publicly.

10 papers0 benchmarksTexts, Videos

Fig-QA

Fig-QA consists of 10256 examples of human-written creative metaphors that are paired as a Winograd schema. It can be used to evaluate the commonsense reasoning of models. The metaphors themselves can also be used as training data for other tasks, such as metaphor detection or generation.

10 papers0 benchmarksTexts

MCoNaLa (Multilingual CoNaLa)

MCoNaLa is a multilingual dataset to benchmark code generation from natural language commands extending beyond English. Modeled off of the methodology from the English Code/Natural Language Challenge (CoNALa) dataset, the authors annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.

10 papers0 benchmarksTexts

SkillSpan (Hard and Soft Skill Extraction from English Job Postings)

SkillSpan is a dataset for Skill Extraction (SE). It is an important and widely-studied task useful to gain insights into labor market dynamics. However, there is a lacuna of datasets and annotation guidelines; available datasets are few and contain crowd-sourced labels on the span-level or labels from a predefined skill inventory. To address this gap, the authors introduce SkillSpan, a novel SE dataset consisting of 14.5K sentences and over 12.5K annotated spans.

10 papers0 benchmarksTexts

StyleGAN-Human

A large-scale human image dataset with over 230K samples capturing diverse poses and textures.

10 papers0 benchmarksImages

OUMVLP

The OU-ISIR Gait Database, Multi-View Large Population Dataset (OU-MVLP) is meant to aid research efforts in the general area of developing, testing and evaluating algorithms for cross-view gait recognition. The Institute of Scientific and Industrial Research (ISIR), Osaka University (OU) has copyright in the collection of gait video and associated data and serves as a distributor of the OU-ISIR Gait Database.

10 papers1 benchmarks

MUStARD++

MUStARD++ is a multimodal sarcasm detection dataset (MUStARD) pre-annotated with 9 emotions. It can be used for the task of detecting the emotion in a sarcastic statement.

10 papers3 benchmarksTexts

AnnoMI

AnnoMI: A Dataset of Expert-Annotated Counselling Dialogues Dataset Introduction Research on natural language processing approaches to analysing counselling dialogues has seen substantial development in recent years, but access to this area remains extremely limited, due to the lack of publicly available expert-annotated therapy conversations. In this paper, we introduce AnnoMI, the first publicly and freely accessible dataset of professionally transcribed dialogues demonstrating high- and low-quality motivational interviewing (MI), an effective counselling technique, with annotations on key MI aspects by domain experts.

10 papers0 benchmarksTexts

BindingDB (The Binding Database)

BindingDB is a public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of protein considered to be drug-targets with small, drug-like molecules. As of May 27, 2022, BindingDB contains 41,296 Entries, each with a DOI, containing 2,519,702 binding data for 8,810 protein targets and 1,080,101 small molecules. There are 5,988 protein-ligand crystal structures with BindingDB affinity measurements for proteins with 100% sequence identity, and 11,442 crystal structures allowing proteins to 85% sequence identity.You can also use BindingDB data through the Registry of Open Data on AWS: https://registry.opendata.aws/binding-db. This dataset using the split by TransformerCPI(doi.org/10.1093/bioinformatics/btaa524)

10 papers1 benchmarks

Nocturne

Nocturne is a 2D, partially observed, driving simulator, built in C++ for speed and exported as a Python library.

10 papers0 benchmarks

PreviousPage 155 of 1000Next