19,997 machine learning datasets
19,997 dataset results
The beginnings of a question answering dataset specifically designed for COVID-19, built by hand from knowledge gathered from Kaggle's COVID-19 Open Research Dataset Challenge.
A novel benchmark dataset that includes a manually annotated point cloud for over 260 million laser scanning points into 100'000 (approx.) assets from Dublin LiDAR point cloud [12] in 2015. Objects are labelled into 13 classes using hierarchical levels of detail from large (i.e., building, vegetation and ground) to refined (i.e., window, door and tree) elements.
Fakeddit is a novel multimodal dataset for fake news detection consisting of over 1 million samples from multiple categories of fake news. After being processed through several stages of review, the samples are labeled according to 2-way, 3-way, and 6-way classification categories through distant supervision.
First-Person Hand Action Benchmark is a collection of RGB-D video sequences comprised of more than 100K frames of 45 daily hand action categories, involving 26 different objects in several hand configurations.
Humicroedit is a humorous headline dataset. The data consists of regular English news headlines paired with versions of the same headlines that contain simple replacement edits designed to make them funny. The authors carefully curated crowdsourced editors to create funny headlines and judges to score a to a total of 15,095 edited headlines, with five judges per headline.
MaSS (Multilingual corpus of Sentence-aligned Spoken utterances) is an extension of the CMU Wilderness Multilingual Speech Dataset, a speech dataset based on recorded readings of the New Testament.
PANDA is the first gigaPixel-level humAN-centric viDeo dAtaset, for large-scale, long-term, and multi-object visual analysis. The videos in PANDA were captured by a gigapixel camera and cover real-world scenes with both wide field-of-view (~1 square kilometer area) and high-resolution details (~gigapixel-level/frame). The scenes may contain 4k head counts with over 100x scale variation. PANDA provides enriched and hierarchical ground-truth annotations, including 15,974.6k bounding boxes, 111.8k fine-grained attribute labels, 12.7k trajectories, 2.2k groups and 2.9k interactions.
The Paris-Lille-3D is a Benchmark on Point Cloud Classification. The Point Cloud has been labeled entirely by hand with 50 different classes. The dataset consists of around 2km of Mobile Laser System point cloud acquired in two cities in France (Paris and Lille).
A large-scale dataset for retrieval and event localisation in video. A unique feature of the dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description of the visual content.
A dataset of 4K out-of-domain (OOD) examples for the publicly available dataset from (Schuster et al. 2019). In contrast to existing settings which synthesize OOD examples by holding out a subset of classes, the examples were authored by annotators with apriori instructions to be out-of-domain with respect to the sentences in an existing dataset.
A dataset containing 404,683 shop photos collected from 25 different online retailers and 20,357 street photos, providing a total of 39,479 clothing item matches between street and shop photos.
Contains bounding boxes for object detection, instance segmentation masks, keypoints for pose estimation and depth images for each synthetic scenery as well as images for each individual seat for classification.
SYSU-30k contains 30k categories of persons, which is about 20 times larger than CUHK03 (1.3k categories) and Market1501 (1.5k categories), and 30 times larger than ImageNet (1k categories). SYSU-30k contains 29,606,918 images. Moreover, SYSU-30k provides not only a large platform for the weakly supervised ReID problem but also a more challenging test set that is consistent with the realistic setting for standard evaluation.
HAKE is built upon existing activity datasets and provides human body part level atomic action labels (Part States).
TITAN consists of 700 labeled video-clips (with odometry) captured from a moving vehicle on highly interactive urban traffic scenes in Tokyo. The dataset includes 50 labels including vehicle states and actions, pedestrian age groups, and targeted pedestrian action attributes that are organized hierarchically corresponding to atomic, simple/complex-contextual, transportive, and communicative actions.
ToyADMOS dataset is a machine operating sounds dataset of approximately 540 hours of normal machine operating sounds and over 12,000 samples of anomalous sounds collected with four microphones at a 48kHz sampling rate, prepared by Yuma Koizumi and members in NTT Media Intelligence Laboratories. The ToyADMOS dataset is designed for anomaly detection in machine operating sounds (ADMOS) research. It is designed for three tasks of ADMOS: product inspection (toy car), fault diagnosis for fixed machine (toy conveyor), and fault diagnosis for moving machine (toy train).
A new dataset for the low-resource language as Vietnamese to evaluate MRC models. This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia.
A large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages from varied language families and with different degrees of resource availability, opening room for evaluation scenarios such as zero-shot cross-lingual transfer.
Opusparcus is a paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows.
Abstract Causal REasoning (ACRE) is a dataset for the systematic evaluation of current vision systems in causal induction, i.e., identifying unobservable mechanisms that lead to the observable relations among variables.