19,997 machine learning datasets
19,997 dataset results
HS-SOD is a hyperspectral salient object detection dataset with a collection of 60 hyperspectral images with their respective ground-truth binary images and representative rendered colour images (sRGB).
Contains 446,684 images annotated by humans that cover 43 incidents across a variety of scenes.
JSUT Corpus is a free large-scale speech corpus that can be shared between academic institutions and commercial companies has an important role. However, such a corpus for Japanese speech synthesis does not exist.
Logic2Text is a large-scale dataset with 10,753 descriptions involving common logic types paired with the underlying logical forms. The logical forms show diversified graph structure of free schema, which poses great challenges on the model's ability to understand the semantics.
Labeled Pedestrian in the Wild (LPW) is a pedestrian detection dataset that contains 2,731 pedestrians in three different scenes where each annotated identity is captured by from 2 to 4 cameras. The LPW features a notable scale of 7,694 tracklets with over 590,000 images as well as the cleanliness of its tracklets. It distinguishes from existing datasets in three aspects: large scale with cleanliness, automatically detected bounding boxes and far more crowded scenes with greater age span. This dataset provides a more realistic and challenging benchmark, which facilitates the further exploration of more powerful algorithms.
MeGlass is an eyeglass dataset originally designed for eyeglass face recognition evaluation. All the face images are selected and cleaned from MegaFace. Each identity has at least two face images with eyeglass and two face images without eyeglass. It contains 47,817 images from 1,710 different identities.
A large new multilingual dataset for multilingual entity linking.
MultiBooked is a dataset for supervised aspect-level sentiment analysis in Basque and Catalan, both of which are under-resourced languages.
An open dataset of real photographs with real noise, from identical scenes captured with varying ISO values. Most images are taken with a Fujifilm X-T1 and XF18-55mm, other photographers are encouraged to contribute images for a more diverse crowdsourced effort.
OpenViDial is a large-scale open-domain dialogue dataset with visual contexts. The dialogue turns and visual contexts are extracted from movies and TV series, where each dialogue turn is paired with the corresponding visual context in which it takes place. OpenViDial contains a total number of 1.1 million dialogue turns, and thus 1.1 million visual contexts stored in images.
OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire English Wikipedia. It containing more than 341M triples. Each triple from the corpus is composed of rich meta-data: each token from the subj / obj / rel along with NLP annotations (POS tag, NER tag, ...), provenance sentence (along with its dependency parse, sentence order relative to the article), original (golden) links contained in the Wikipedia articles, space / time.
Source: BARThez: a Skilled Pretrained French Sequence-to-Sequence Model
Peyma is a Persian NER dataset to train and test NER systems. It is constructed by collecting documents from ten news websites.
PHINC is a parallel corpus of the 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. The translations of sentences are done manually by the annotators.
The PolEmo2.0 is a dataset of online consumer reviews from four domains: medicine, hotels, products, and university. It is human-annotated on a level of full reviews and individual sentences. It comprises over 8000 reviews, about 85% from the medicine and hotel domains.
Consists of multiple sentences whose clues are arranged by difficulty (from obscure to obvious) and uniquely identify a well-known entity such as those found on Wikipedia.
The RIT-18 dataset was built for the semantic segmentation of remote sensing imagery. It was collected with the Tetracam Micro-MCA6 multispectral imaging sensor flown on-board a DJI-1000 octocopter.
The SAMM Long Videos dataset consists of 147 long videos with 343 macro-expressions and 159 micro-expressions. The dataset is FACS-coded with detailed Action Units.
SEND (Stanford Emotional Narratives Dataset) is a set of rich, multimodal videos of self-paced, unscripted emotional narratives, annotated for emotional valence over time. The complex narratives and naturalistic expressions in this dataset provide a challenging test for contemporary time-series emotion recognition models.
SKU110K-R is a dataset relabeled with oriented bounding boxes based on SKU110K. It is focused on evaluating oriented and densely packed object detection.