Datasets

19,997 machine learning datasets

19,997 dataset results

MediaSpeech

MediaSpeech is a media speech dataset (you might have guessed this) built with the purpose of testing Automated Speech Recognition (ASR) systems performance. The dataset consists of short speech segments automatically extracted from media videos available on YouTube and manually transcribed, with some pre- and post-processing. The dataset contains 10 hours of speech for each language provided. This release contains audio datasets in French, Arabic, Turkish and Spanish, and is a part of a larger private dataset.

5 papers4 benchmarksSpeech

NFCorpus

NFCorpus is a full-text English retrieval data set for Medical Information Retrieval. It contains a total of 3,244 natural language queries (written in non-technical English, harvested from the NutritionFacts.org site) with 169,756 automatically extracted relevance judgments for 9,964 medical documents (written in a complex terminology-heavy language), mostly from PubMed.

5 papers1 benchmarksTexts

CQADupStack

CQADupStack is a benchmark dataset for community question-answering research. It contains threads from twelve StackExchange subforums, annotated with duplicate question information. Pre-defined training and test splits are provided, both for retrieval and classification experiments, to ensure maximum comparability between different studies using the set. Furthermore, it comes with a script to manipulate the data in various ways.

5 papers1 benchmarksTexts

Intel Image Classification

Context This is image data of Natural Scenes around the world.

5 papers3 benchmarks

Vent

The Vent dataset is a large annotated dataset of text, emotions, and social connections. It comprises more than 33 millions of posts by nearly a million of users together with their social connections. Each post has an associated emotion. There are 705 different emotions, organized in 63 "emotion categories", forming a two-level taxonomy of affects.

5 papers0 benchmarksGraphs, Texts

Public Git Archive

The Public Git Archive is a dataset of 182,014 top-bookmarked Git repositories from GitHub totalling 6 TB. The dataset provides the source code of the projects, the related metadata, and development history.

5 papers0 benchmarks

CSRC (Children Speech Recognition Challenge)

CSRC is a collection of data for Children Speech Recognition. The data for this challenge is divided into 3 datasets, referred to as A (Adult speech training set), C1 (Children speech training set) and C2 (Children conversation training set). All dataset combined amount to 400 hours of Mandarin speech data.

5 papers0 benchmarksSpeech

NELA-GT-2019

NELA-GT-2019 is an updated version of the NELA-GT-2018 dataset. NELA-GT-2019 contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Just as with NELA-GT-2018, these sources come from a wide range of mainstream news sources and alternative news sources. Included with the dataset are source-level ground truth labels from 7 different assessment sites covering multiple dimensions of veracity.

5 papers0 benchmarksTexts

20-MAD (20-MAD: Mozilla Apache Dataset)

20-MAD, a dataset linking the commit and issue data of Mozilla and Apache projects. It includes over 20 years of information about 765 projects, 3.4M commits, 2.3M issues, and 17.3M issue comments, and its compressed size is over 6 GB. The data contains all the typical information about source code commits (e.g., lines added and removed, message and commit time) and issues (status, severity, votes, and summary). The issue comments have been pre-processed for natural language processing and sentiment analysis. This includes emoticons and valence and arousal scores.

5 papers0 benchmarks

SPHERE

The dataset for the SPHERE challenge consists on a multimodal activity recognition dataset consisting of accelerometer, RGB-D and environmental data. Accelerometer is samplled at 20 Hz and given in its raw format. Raw video is not given in order to preserve anonymity of the participants. Instead, extracted features that relate to the centre of mass and bounding box of the identified persons are provided. Environmental data consists of Passive Infra-Red (PIR) sensors, and these is given in raw format.

5 papers0 benchmarks

australian (Statlog (Australian Credit Approval) Data Set)

Data Set Information:

5 papers1 benchmarks

LoLi-Phone

LoLi-Phone is a large-scale low-light image and video dataset for Low-light image enhancement (LLIE). The images and videos are taken by different mobile phones' cameras under diverse illumination conditions.

5 papers0 benchmarksImages, Videos

IoT Inspector

IoT Inspector is a large dataset of labeled network traffic from smart home devices from within real-world home networks. It is used to conduct data-driven smart home research. An open source tool with the same name has been used to collect data from 44,956 smart home devices across 13 categories and 53 vendors.

5 papers0 benchmarks

Kvasir-Sessile dataset (Sessile polyps from Kvasir-SEG)

The Kvasir-SEG dataset includes 196 polyps smaller than 10 mm classified as Paris class 1 sessile or Paris class IIa. We have selected it with the help of expert gastroenterologists. We have released this dataset separately as a subset of Kvasir-SEG. We call this subset Kvasir-Sessile.

5 papers0 benchmarksBiomedical, Images, Medical

TSP/HCP Benchmark set

This is a benchmark set for Traveling salesman problem (TSP) with characteristics that are different from the existing benchmark sets. In particular, it focuses on small instances which prove to be challenging for one or more state-of-the-art TSP algorithms. These instances are based on difficult instances of Hamiltonian cycle problem (HCP). This includes instances from literature, specially modified randomly generated instances, and instances arising from the conversion of other difficult problems to HCP.

5 papers1 benchmarksGraphs

MIMII DUE

This dataset is a sound dataset for malfunctioning industrial machine investigation and inspection with domain shifts due to changes in operational and environmental conditions (MIMII DUE). The dataset consists of normal and abnormal operating sounds of five different types of industrial machines, i.e., fans, gearboxes, pumps, slide rails, and valves. The data for each machine type includes six subsets called "sections'', and each section roughly corresponds to a single product. Each section consists of data from two domains, called the source domain and the target domain, with different conditions such as operating speed and environmental noise. This dataset is a subset of the dataset for DCASE 2021 Challenge Task 2, so the dataset is entirely the same as data included in the development dataset and additional training dataset.

5 papers0 benchmarksAudio

PreviousPage 215 of 1000Next

Datasets

MediaSpeech

NFCorpus

CQADupStack

Intel Image Classification

Vent

Public Git Archive

CSRC (Children Speech Recognition Challenge)

NELA-GT-2019

20-MAD (20-MAD: Mozilla Apache Dataset)

SPHERE

australian (Statlog (Australian Credit Approval) Data Set)

LoLi-Phone

IoT Inspector

Kvasir-Sessile dataset (Sessile polyps from Kvasir-SEG)

TSP/HCP Benchmark set

MIMII DUE

Twitter Abusive Behavior

On the Origins of Memes by Means of Fringe Web Communities

LabPics (LabPics Dataset for computer vision for autonomous chemistry labs and medical labs)

Netzschleuder (network catalogue, repository and centrifuge)

Datasets

MediaSpeech

NFCorpus

CQADupStack

Intel Image Classification

Vent

Public Git Archive

CSRC (Children Speech Recognition Challenge)

NELA-GT-2019

20-MAD (20-MAD: Mozilla Apache Dataset)

SPHERE

australian (Statlog (Australian Credit Approval) Data Set)

LoLi-Phone

IoT Inspector

Kvasir-Sessile dataset (Sessile polyps from Kvasir-SEG)

TSP/HCP Benchmark set

MIMII DUE

Twitter Abusive Behavior

On the Origins of Memes by Means of Fringe Web Communities

LabPics (LabPics Dataset for computer vision for autonomous chemistry labs and medical labs)

Netzschleuder (network catalogue, repository and centrifuge)