19,997 machine learning datasets
19,997 dataset results
The IWSLT 2019 dataset contains source, Machine Translated, reference and Post-Edited text, which can be used to quantify and evaluate Post-editing effort after automatic MT.
The Kenyan Food Type Dataset (KenyanFood13) is an image classification dataset for Kenyan food. The images are categorized into 13 different labels.
Presents 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea.
LibriVoxDeEn is a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audiobooks. The speech translation data consist of 110 hours of audio material aligned to over 50k parallel sentences. An even larger dataset comprising 547 hours of German speech aligned to German text is available for speech recognition. The audio data is read speech and thus low in disfluencies.
LKS is a dataset of 684 Liver-Kidney-Stomach immunofluorescence whole slide images (WSIs) used in the investigation of autoimmune liver disease.
A large-scale logo image database for logo detection and brand recognition from real-world product images.
Long-term visual localization provides a benchmark datasets aimed at evaluating 6 DoF pose estimation accuracy over large appearance variations caused by changes in seasonal (summer, winter, spring, etc.) and illumination (dawn, day, sunset, night) conditions. Each dataset consists of a set of reference images, together with their corresponding ground truth poses, and a set of query images.
A large-scale dataset of memes with captions and class labels. The dataset consists of 1.1 million meme captions from 128 classes.
A multi-sensor, multi-modal dataset, implemented to benchmark Human Activity Recognition(HAR) and Multi-modal Fusion algorithms. Collection of this dataset was inspired by the need for recognising and evaluating quality of exercise performance to support patients with Musculoskeletal Disorders(MSD).
Contains mitochondria instances.
A new resource to train and evaluate multitask systems on samples in multiple modalities and three languages.
Contains 25,165 textual news articles collected from hundreds of news media sites (e.g., Yahoo News, Google News, CNN News.) and 76,516 image posts shared on Flickr social media, which are annotated according to 412 real-world events. The dataset is collected to explore the problem of organizing heterogeneous data contributed by professionals and amateurs in different data domains, and the problem of transferring event knowledge obtained from one data domain to heterogeneous data domain, thus summarizing the data with different contributors.
A large-scale multilingual corpus of images, each labeled with the word it represents. The dataset includes approximately 10,000 words in each of 100 languages.
A novel database comprising representations of five different biometric characteristics, collected in a mobile, unconstrained or semi-constrained setting with three different mobile devices, including characteristics previously unavailable in existing datasets, namely hand images, thermal hand images, and thermal face images, all acquired with a mobile, off-the-shelf device.
Includes challenging sequences and extensive data stratification in-terms of camera and object motion, velocity magnitudes, direction, and rotational speeds.
The largest dataset of extracted visual content from historic newspapers ever produced. The Newspaper Navigator dataset, finetuned visual content recognition model.
The ODSQA dataset is a spoken dataset for question answering in Chinese. It contains more than three thousand questions from 20 different speakers.
A manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive.
PARADE contains paraphrases that overlap very little at the lexical and syntactic level but are semantically equivalent based on computer science domain knowledge, as well as non-paraphrases that overlap greatly at the lexical and syntactic level but are not semantically equivalent based on this domain knowledge.
Pars-ABSA is a manually annotated Persian dataset, Pars-ABSA, which is verified by 3 native Persian speakers. The dataset consists of 5,114 positive, 3,061 negative and 1,827 neutral data samples from 5,602 unique reviews.