Datasets

19,997 machine learning datasets

19,997 dataset results

IWSLT 2019

The IWSLT 2019 dataset contains source, Machine Translated, reference and Post-Edited text, which can be used to quantify and evaluate Post-editing effort after automatic MT.

3 papers0 benchmarksTexts

KenyanFood13

The Kenyan Food Type Dataset (KenyanFood13) is an image classification dataset for Kenyan food. The images are categorized into 13 different labels.

3 papers0 benchmarksImages

Korean HateSpeech Dataset

Presents 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea.

3 papers0 benchmarks

LibriVoxDeEn is a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audiobooks. The speech translation data consist of 110 hours of audio material aligned to over 50k parallel sentences. An even larger dataset comprising 547 hours of German speech aligned to German text is available for speech recognition. The audio data is read speech and thus low in disfluencies.

3 papers0 benchmarksSpeech

LKS (Liver Kidney Stomach)

LKS is a dataset of 684 Liver-Kidney-Stomach immunofluorescence whole slide images (WSIs) used in the investigation of autoimmune liver disease.

3 papers0 benchmarksMedical

LOGO-Net

A large-scale logo image database for logo detection and brand recognition from real-world product images.

3 papers0 benchmarks

Long-term visual localization

Long-term visual localization provides a benchmark datasets aimed at evaluating 6 DoF pose estimation accuracy over large appearance variations caused by changes in seasonal (summer, winter, spring, etc.) and illumination (dawn, day, sunset, night) conditions. Each dataset consists of a set of reference images, together with their corresponding ground truth poses, and a set of query images.

3 papers0 benchmarks

Memeify

A large-scale dataset of memes with captions and class labels. The dataset consists of 1.1 million meme captions from 128 classes.

3 papers0 benchmarks

MEx

A multi-sensor, multi-modal dataset, implemented to benchmark Human Activity Recognition(HAR) and Multi-modal Fusion algorithms. Collection of this dataset was inspired by the need for recognising and evaluating quality of exercise performance to support patients with Musculoskeletal Disorders(MSD).

3 papers0 benchmarks

MitoEM

Contains mitochondria instances.

3 papers8 benchmarks

MLM

A new resource to train and evaluate multitask systems on samples in multiple modalities and three languages.

3 papers0 benchmarks

MMED

Contains 25,165 textual news articles collected from hundreds of news media sites (e.g., Yahoo News, Google News, CNN News.) and 76,516 image posts shared on Flickr social media, which are annotated according to 412 real-world events. The dataset is collected to explore the problem of organizing heterogeneous data contributed by professionals and amateurs in different data domains, and the problem of transferring event knowledge obtained from one data domain to heterogeneous data domain, thus summarizing the data with different contributors.

3 papers0 benchmarks

MMID (Massively Multilingual Image Dataset)

A large-scale multilingual corpus of images, each labeled with the word it represents. The dataset includes approximately 10,000 words in each of 100 languages.

3 papers0 benchmarks

MobiBits (Multimodal Mobile Biometric Database)

A novel database comprising representations of five different biometric characteristics, collected in a mobile, unconstrained or semi-constrained setting with three different mobile devices, including characteristics previously unavailable in existing datasets, namely hand images, thermal hand images, and thermal face images, all acquired with a mobile, off-the-shelf device.

3 papers0 benchmarks

MOD++

Includes challenging sequences and extensive data stratification in-terms of camera and object motion, velocity magnitudes, direction, and rotational speeds.

3 papers0 benchmarks

Newspaper Navigator

The largest dataset of extracted visual content from historic newspapers ever produced. The Newspaper Navigator dataset, finetuned visual content recognition model.

3 papers0 benchmarks

ODSQA (Open-Domain Spoken Question Answering)

The ODSQA dataset is a spoken dataset for question answering in Chinese. It contains more than three thousand questions from 20 different speakers.

3 papers0 benchmarksAudio, Texts

OGTD (Offensive Greek Tweet Dataset)

A manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive.

3 papers0 benchmarksTexts

PARADE

PARADE contains paraphrases that overlap very little at the lexical and syntactic level but are semantically equivalent based on computer science domain knowledge, as well as non-paraphrases that overlap greatly at the lexical and syntactic level but are not semantically equivalent based on this domain knowledge.

3 papers0 benchmarks

Pars-ABSA

Pars-ABSA is a manually annotated Persian dataset, Pars-ABSA, which is verified by 3 native Persian speakers. The dataset consists of 5,114 positive, 3,061 negative and 1,827 neutral data samples from 5,602 unique reviews.

3 papers0 benchmarks

PreviousPage 262 of 1000Next