Datasets

19,997 machine learning datasets

19,997 dataset results

FKD (Football Keywords Dataset)

The football keyword dataset (FKD), as a new keyword spotting dataset in Persian, is collected with crowdsourcing. This dataset contains nearly 31000 samples in 18 classes.

2 papers1 benchmarks

Reddit C-SSRS

The C-SSRS dataset contains 500 Reddit posts from the subreddit r/depression. These posts are labeled by psychologists on a five point scale according to guidelines established in the Columbia Suicide Severity Rating Scale, which progress according to severity of depression. As this dataset is clinically verified and labeled, it is an adequate dataset to validate the label correction method, especially since it is from the same domain of mental health.

2 papers0 benchmarks

AraCOVID19-MFH (AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News and Hate Speech Detection Dataset)

AraCOVID19-MFH is a manually annotated multi-label Arabic COVID-19 fake news and hate speech detection dataset. The dataset contains 10,828 Arabic tweets annotated with 10 different labels.

2 papers0 benchmarksTexts

AvaSym

Global Symmetry Ground-truth for AVA dataset.

2 papers0 benchmarksImages

R2VQ (Recipe-to-Video Questions)

R2VQ is a dataset designed for testing competence-based comprehension of machines over a multimodal recipe collection, which contains text-video aligned recipes.

2 papers0 benchmarksTexts, Videos

DBATES (DataBase of Audio features, Text and visual Expressions in competitive debate Speeches)

DBATES is a database of multimodal communication features extracted from debate speeches in the 2019 North American Universities Debate Championships (NAUDC).

2 papers0 benchmarks

Fetoscopy Placenta Data

The fetoscopy placenta dataset is associated with our MICCAI2020 publication titled “Deep Placental Vessel Segmentation for Fetoscopic Mosaicking”. The dataset contains 483 frames with ground-truth vessel segmentation annotations taken from six different in vivo fetoscopic procedure videos. The dataset also includes six unannotated in vivo continuous fetoscopic video clips (950 frames) with predicted vessel segmentation maps obtained from the leave-one-out cross-validation of our method.

2 papers0 benchmarksImages, Videos

Fusion-DHL

Fusion-DHL is a multimodal sensor dataset with ground-truth positions.

2 papers0 benchmarksTime series

Essay-BR

This repository contains essays written by high school Brazilian students. These essays were graded by humans professionals following the criteria of the ENEM exam.

2 papers0 benchmarksTexts

Rare Diseases Mentions in MIMIC-III (Rare disease mention annotations from a sample of MIMIC-III clinical notes)

Data annotation The 1,073 full rare disease mention annotations (from 312 MIMIC-III discharge summaries) are in full_set_RD_ann_MIMIC_III_disch.csv.

2 papers1 benchmarksTexts

ReactionGIF

ReactionGIF is an affective dataset of 30K tweets which can be used for tasks like induced sentiment prediction and multilabel classification of induced emotions.

2 papers0 benchmarksImages, Texts

EPISURG (EPISURG: a dataset of postoperative MRI for quantitative analysis of resection neurosurgery for refractory epilepsy)

EPISURG is a clinical dataset of $T_1$-weighted magnetic resonance images (MRI) from 430 epileptic patients who underwent resective brain surgery at the National Hospital of Neurology and Neurosurgery (Queen Square, London, United Kingdom) between 1990 and 2018.

2 papers0 benchmarks3D, Images, MRI, Medical

TEP (Tennessee Eastman Process)

The original paper presented a model of the industrial chemical process named Tennessee Eastman Process and a model-based TEP simulator for data generation. The most widely used benchmark consists of 22 datasets, 21 of which (Fault 1–21) contain faults and 1 (Fault 0) is fault-free. It is available in repository. All datasets have training (500 samples) and testing (960 samples) parts: training part has healthy state observations, testing part begins right after training, and contains faults which appear after 8 h since the training part. Each dataset has 52 features or observation variables with a 3 min sampling rate for most of all.

2 papers3 benchmarksTime series

POINTREC

POINTREC is a test collection for point of interest (POI) recommendation, comprising of (i) a set of information needs, (ii) a dataset of POIs, and (iii) graded relevance assessments for information need and POI pairs.

2 papers0 benchmarksTexts

SkyCam

SkyCam dataset is a collection of sky images from a variety of locations with diverse topological characteristics (Swiss Jura, Plateau and Pre-Alps regions), from both single and stereo camera settings coupled with a high-accuracy pyranometers. The dataset was collected with a high frequency with a data sample every 10 seconds. 13 images with different exposures times are generated along with a post-processed HDR images and a solar radiance values for each of the cameras and locations. We hope that SkyCam dataset will enable researchers to tackle the problem of short-term local camera-based solar radiance prediction.

2 papers0 benchmarks

RISEdb (Robust Indoor Localization in Complex Scenarios (RISE) database)

The RISE (Robust Indoor Localization in Complex Scenarios) dataset is meant to train and evaluate visual indoor place recognizers. It contains more than 1 million geo-referenced images spread over 30 sequences, covering 5 heterogeneous buildings. For each building we provide: - A high resolution 3D point cloud (1cm) that defines the localization reference frame and that was generated with a mobile laser scanner and an inertial system. - Several image sequences spread over time with accurate ground truth poses retrieved by the laser scanner. Each sequence contains both, stereo pairs and spherical images. - Geo-referenced smartphone data, retrieved from the standard sensors of such devices.

2 papers0 benchmarks3D, Images, LiDAR, Videos

SynthDerm

SynthDerm is a synthetically generated dataset inspired by the real-world characteristics of melanoma skin lesions in dermatology settings. These characteristics include whether the lesion is asymmetrical, its border is irregular or jagged, is unevenly colored, has a diameter more than 0.25 inches, or is evolving in size, shape, or color over time. These qualities are usually referred to as ABCDE of melanoma. We generate SynthDerm algorithmically by varying several factors: skin tone, lesion shape, lesion size, lesion location (vertical and horizontal), and whether there are surgical markings present. We randomly assign one of the following to the lesion shape: round, asymmetrical, with jagged borders, or multi-colored (two different shades of colors overlaid with salt-and-pepper noise). For skin tone values, we simulate Fitzpatrick ratings. Fitzpatrick scale is a commonly used approach to classify the skin by its reaction to sunlight exposure modulated by the density of melanin pigmen

2 papers0 benchmarks

Instantiation Dataset

Instantiation is a dataset for the task of instantiation detection

2 papers0 benchmarksTexts

UW-IS (UW Indoor Scenes)

UW-IS (UW Indoor Scenes) is a dataset for object recognition in indoor environments comprising scene images from two different environments, namely, a living room and a mock warehouse.

2 papers0 benchmarksImages

Data Collected with Package Delivery Quadcopter Drone

This experiment was performed in order to empirically measure the energy use of small, electric Unmanned Aerial Vehicles (UAVs). We autonomously direct a DJI ® Matrice 100 (M100) drone to take off, carry a range of payload weights on a triangular flight pattern, and land. Between flights, we varied specified parameters through a set of discrete options, payload of 0 , 250 g and 500 g; altitude during cruise of 25 m, 50 m, 75 m and 100 m; and speed during cruise of 4 m/s, 6 m/s, 8 m/s, 10 m/s and 12 m/s.

2 papers2 benchmarksTabular, Time series

PreviousPage 312 of 1000Next