19,997 machine learning datasets
19,997 dataset results
BirdClef 2019 is a bird soundscape dataset. It contains around 350 hours of manually annotated soundscapes using 30 field recorders between January and June of 2017 in Ithaca, NY, USA. There are around 50,000 recordings in the dataset in total, with 659 classes. The dataset also contains species tags.
BirdClef 2018 is a bird soundscape dataset based on the contributions of the Xeno-canto network. The training set contains 36,496 recordings covering 1500 species of central and south America (the largest bioacoustic dataset in the literature). There are about 68 hours of recordings in total, with 1,500 classes and species tags.
The ISIC 2017 dataset was published by the International Skin Imaging Collaboration (ISIC) as a large-scale dataset of dermoscopy images. The Task 2 challenge dataset for lesion dermoscopic feature extraction contains the original lesion image, a corresponding superpixel mask, and superpixel-mapped expert annotations of the presence and absence of the following features: (a) network, (b) negative network, (c) streaks and (d) milia-like cysts.
The Rendered SST2 dataset is a dataset released by OpenAI, that measures the optical character recognition capability of visual representations. It uses sentences from the Stanford Sentiment Treebank dataset and renders them into images, with black texts on a white background, in a 448×448 resolution.
Will-They-Won't-They (WT-WT) is a large dataset of English tweets targeted at stance detection for the rumor verification task. The dataset is constructed based on tweets that discuss five recent merger and acquisition (M&A) operations of US companies, mainly from the healthcare sector.
We release the MUStARD dataset which is a multimodal video corpus for research in automated sarcasm discovery. The dataset is compiled from popular TV shows including Friends, The Golden Girls, The Big Bang Theory, and Sarcasmaholics Anonymous. MUStARD consists of audiovisual utterances annotated with sarcasm labels. Each utterance is accompanied by its context, which provides additional information on the scenario where the utterance occurs.
The dataset, VIST-Edit, includes 14,905 human-edited versions of 2,981 machine-generated visual stories. The stories were generated by two state-of-the-art visual storytelling models, each aligned to 5 human-edited versions.
X-WikiRE is a new, large-scale multilingual relation extraction dataset in which relation extraction is framed as a problem of reading comprehension to allow for generalization to unseen relations.
Dataset of Legal Documents consists of court decisions from 2017 and 2018 were selected for the dataset, published online by the Federal Ministry of Justice and Consumer Protection. The documents originate from seven federal courts: Federal Labour Court (BAG), Federal Fiscal Court (BFH), Federal Court of Justice (BGH), Federal Patent Court (BPatG), Federal Social Court (BSG), Federal Constitutional Court (BVerfG) and Federal Administrative Court (BVerwG).
Extended Labeled Faces in-the-Wild (ELFW) is a dataset supplementing with additional face-related categories —and also additional faces— the originally released semantic labels in the vastly used Labeled Faces in-the-Wild (LFW) dataset. Additionally, two object-based data augmentation techniques are deployed to synthetically enrich under-represented categories which, in benchmarking experiments, reveal that not only segmenting the augmented categories improves, but also the remaining ones benefit.
KANFace consists of 40K still images and 44K sequences (14.5M video frames in total) captured in unconstrained, real-world conditions from 1,045 subjects. The dataset is manually annotated in terms of identity, exact age, gender and kinship.
AV Digits Database is an audiovisual database which contains normal, whispered and silent speech. 53 participants were recorded from 3 different views (frontal, 45 and profile) pronouncing digits and phrases in three speech modes.
The Unhealthy Comments Corpus (UCC) is corpus of 44355 comments intended to assist in research on identifying subtle attributes which contribute to unhealthy conversations online.
This dataset contains panoramic video captured from a helmet-mounted camera while riding a bike through suburban Northern Virginia.
NQuAD is a Nuclear Question Answering Dataset, which contains 700+ nuclear Question Answer pairs developed and verified by expert nuclear researchers.
NYT-H is a dataset for distantly-supervised relation extraction, in which DS-labelled training data is used and several annotators to label test data are hired. NYT-H can serve as a benchmark of distantly-supervised relation extraction.
VideoForensicsHQ is a benchmark dataset for face video forgery detection, providing high quality visual manipulations. It is one of the first face video manipulation benchmark sets that also contains audio and thus complements existing datasets along a new challenging dimension. VideoForensicsHQ shows manipulations at much higher video quality and resolution, and shows manipulations that are provably much harder to detect by humans than videos in other datasets.
SmartCity consists of 50 images in total collected from ten city scenes including office entrance, sidewalk, atrium, shopping mall etc.. Unlike the existing crowd counting datasets with images of hundreds/thousands of pedestrians and nearly all the images being taken outdoors, SmartCity has few pedestrians in images and consists of both outdoor and indoor scenes: the average number of pedestrians is only 7.4 with minimum being 1 and maximum being 14.
The UFPR-Eyeglasses dataset has 1,135 images of both eyes (2,270 cropped images of each eye) from 83 subjects (166 classes). The dataset is used to evaluate the effect of the occlusion caused by eyeglasses in periocular recognition.
The Circa (meaning ‘approximately’) dataset aims to help machine learning systems to solve the problem of interpreting indirect answers to polar questions.