Datasets

19,997 machine learning datasets

19,997 dataset results

CREMI

MICCAI Challenge on Circuit Reconstruction from Electron Microscopy Images.

4 papers2 benchmarks3D, Biomedical

Nottingham

The Nottingham Dataset is a collection of 1200 American and British folk songs.

4 papers2 benchmarksAudio

MESA (Multi-Ethnic Study of Atherosclerosis)

Multi-Ethnic Study of Atherosclerosis (MESA) is an NHLBI-sponsored 6-center collaborative longitudinal investigation of factors associated with the development of subclinical cardiovascular disease and the progression of subclinical to clinical cardiovascular disease.

4 papers2 benchmarks

VOT2019

VOT2019 is a Visual Object Tracking benchmark for short-term tracking in RGB.

4 papers4 benchmarksImages, Tracking, Videos

SCDE

SCDE is a human-created sentence cloze dataset, collected from public school English examinations in China. The task requires a model to fill up multiple blanks in a passage from a shared candidate set with distractors designed by English teachers.

4 papers3 benchmarksTexts

THUCNews (THU Chinese Text Classification)

The THUCNews Chinese text dataset is a large-scale Chinese text classification dataset. It contains approximately 840,000 news documents categorized into 14 classes. The dataset was generated by filtering historical data from the Sina News RSS feeds between 2005 and 2011. This dataset can be used for various tasks such as text classification and training word vectors.

4 papers0 benchmarks

ACM (Association for Computing Machinery Active Contour Model algebraic collective model and-Compare Module Active Contour Models)

The ACM dataset contains papers published in KDD, SIGMOD, SIGCOMM, MobiCOMM, and VLDB and are divided into three classes (Database, Wireless Communication, Data Mining). An heterogeneous graph is constructed, which comprises 3025 papers, 5835 authors, and 56 subjects. Paper features correspond to elements of a bag-of-words represented of keywords.

4 papers2 benchmarksGraphs

Sydney Urban Objects

This dataset contains a variety of common urban road objects scanned with a Velodyne HDL-64E LIDAR, collected in the CBD of Sydney, Australia. There are 631 individual scans of objects across classes of vehicles, pedestrians, signs and trees.

4 papers3 benchmarks3D, LiDAR, Point cloud

PhyAAt (Physiology of Auditory Attention)

The dataset contains a collection of physiological signals (EEG, GSR, PPG) obtained from an experiment of the auditory attention on natural speech. Ethical Approval was acquired for the experiment. Details of the experiment can be found here https://phyaat.github.io/experiment

4 papers2 benchmarksEEG, Time series

SVG-Icons8

A new large-scale dataset along with an open-source library for SVG manipulation.

4 papers0 benchmarks

MLFP (Multispectral Latex Mask based Video Face Presentation Attack)

The MLFP dataset consists of face presentation attacks captured with seven 3D latex masks and three 2D print attacks. The dataset contains videos captured from color, thermal and infrared channels.

4 papers8 benchmarksImages

DDRel

DDRel is a dataset for interpersonal relation classification in dyadic dialogues. It consists of 6,300 dyadic dialogue sessions between 694 pairs of speakers with 53,126 utterances in total. It is constructed by crawling movie scripts from IMSDb and annotating the relation labels for each session according to 13 pre-defines relationships.

4 papers6 benchmarksTexts

OC (Drowsiness-Detection)

These images were generated using UnityEyes simulator, after including essential eyeball physiology elements and modeling binocular vision dynamics. The images are annotated with head pose and gaze direction information, besides 2D and 3D landmarks of eye's most important features. Additionally, the images are distributed into two classes denoting the status of the eye (Open for open eyes, Closed for closed eyes). This dataset was used to train a DNN model for detecting drowsiness status of a driver. The dataset contains 1,704 training images, 4,232 testing images and additional 4,103 images for improvements.

4 papers0 benchmarks

PhysioNet Challenge 2020

Data The data for this Challenge are from multiple sources: CPSC Database and CPSC-Extra Database INCART Database PTB and PTB-XL Database The Georgia 12-lead ECG Challenge (G12EC) Database Undisclosed Database The first source is the public (CPSC Database) and unused data (CPSC-Extra Database) from the China Physiological Signal Challenge in 2018 (CPSC2018), held during the 7th International Conference on Biomedical Engineering and Biotechnology in Nanjing, China. The unused data from the CPSC2018 is NOT the test data from the CPSC2018. The test data of the CPSC2018 is included in the final private database that has been sequestered. This training set consists of two sets of 6,877 (male: 3,699; female: 3,178) and 3,453 (male: 1,843; female: 1,610) of 12-ECG recordings lasting from 6 seconds to 60 seconds. Each recording was sampled at 500 Hz.

4 papers35 benchmarksBiomedical, Time series

Amazon Fine Foods

Amazon Fine Foods is a dataset that consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review.

4 papers0 benchmarksGraphs

GoodSounds

GoodSounds dataset contains around 28 hours of recordings of single notes and scales played by 15 different professional musicians, all of them holding a music degree and having some expertise in teaching. 12 different instruments (flute, cello, clarinet, trumpet, violin, alto sax alto, tenor sax, baritone sax, soprano sax, oboe, piccolo and bass) were recorded using one or up to 4 different microphones. For all the instruments the whole set of playable semitones in the instrument is recorded several times with different tonal characteristics. Each note is recorded into a separate monophonic audio file of 48kHz and 32 bits. Rich annotations of the recordings are available, including details on recording environment and rating on tonal qualities of the sound (“good-sound”, “bad”, “scale-good”, “scale-bad”).

4 papers0 benchmarksAudio

VIVA (Vision for Intelligent Vehicles and Applications)

The VIVA challenge’s dataset is a multimodal dynamic hand gesture dataset specifically designed with difficult settings of cluttered background, volatile illumination, and frequent occlusion for studying natural human activities in real-world driving settings. This dataset was captured using a Microsoft Kinect device, and contains 885 intensity and depth video sequences of 19 different dynamic hand gestures performed by 8 subjects inside a vehicle.

4 papers0 benchmarksImages

Fraunhofer IPA Bin-Picking

The Fraunhofer IPA Bin-Picking dataset is a large-scale dataset comprising both simulated and real-world scenes for various objects (potentially having symmetries) and is fully annotated with 6D poses. A pyhsics simulation is used to create scenes of many parts in bulk by dropping objects in a random position and orientation above a bin. Additionally, this dataset extends the Siléane dataset by providing more samples. This allows to e.g. train deep neural networks and benchmark the performance on the public Siléane dataset

4 papers0 benchmarks6D, Images

TAC 2010

TAC 2010 is a dataset for summarization that consists of 44 topics, each of which is associated with a set of 10 documents. The test dataset is composed of approximately 44 topics, divided into five categories: Accidents and Natural Disasters, Attacks, Health and Safety, Endangered Resources, Investigations and Trials.

4 papers0 benchmarksTexts

Musk v1

The Musk dataset describes a set of molecules, and the objective is to detect musks from non-musks. This dataset describes a set of 92 molecules of which 47 are judged by human experts to be musks and the remaining 45 molecules are judged to be non-musks. There are 166 features available that describe the molecules based on the shape of the molecule.

4 papers3 benchmarksTabular

PreviousPage 230 of 1000Next