19,997 machine learning datasets
19,997 dataset results
An open-ended VideoQA benchmark that aims to: i) provide a well-defined evaluation by including five correct answer annotations per question and ii) avoid questions which can be answered without the video.
KdConv is a Chinese multi-domain Knowledge-driven Conversation dataset, grounding the topics in multi-turn conversations to knowledge graphs. KdConv contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0. These conversations contain in-depth discussions on related topics and natural transition between multiple topics, while the corpus can also used for exploration of transfer learning and domain adaptation.
BookTest is a new dataset similar to the popular Children’s Book Test (CBT), however more than 60 times larger.
To study the task of email subject line generation: automatically generating an email subject line from the email body.
Contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible. This dataset contains about 3.65 million human labeled frames or about 38.5 hours of face tracks, and the corresponding audio.
Biased Action Recognition (BAR) dataset is a real-world image dataset categorized as six action classes which are biased to distinct places. The authors settle these six action classes by inspecting imSitu, which provides still action images from Google Image Search with action and place labels. In detail, the authors choose action classes where images for each of these candidate actions share common place characteristics. At the same time, the place characteristics of action class candidates should be distinct in order to classify the action only from place attributes. The select pairs are six typical action-place pairs: (Climbing, RockWall), (Diving, Underwater), (Fishing, WaterSurface), (Racing, APavedTrack), (Throwing, PlayingField),and (Vaulting, Sky).
CARRADA is a dataset of synchronized camera and radar recordings with range-angle-Doppler annotations.
The Cars Overhead With Context (COWC) data set is a large set of annotated cars from overhead. It is useful for training a device such as a deep neural network to learn to detect and/or count cars.
A collection of single speaker speech datasets for ten languages. It is composed of short audio clips from LibriVox audiobooks and their aligned texts.
Includes accurate pixel-wise motion masks, egomotion and ground truth depth.
A new benchmark dataset for cross-lingual and multilingual question answering for high school examinations. Collects more than 24,000 high-quality high school exam questions in 16 languages, covering 8 language families and 24 school subjects from Natural Sciences and Social Sciences, among others. EXAMS offers a fine-grained evaluation framework across multiple languages and subjects, which allows precise analysis and comparison of various models.
The Fusion 360 Gallery Dataset contains rich 2D and 3D geometry data derived from parametric CAD models. The dataset is produced from designs submitted by users of the CAD package Autodesk Fusion 360 to the Autodesk Online Gallery. The dataset provides valuable data for learning how people design, including sequential CAD design data, designs segmented by modelling operation, and design hierarchy and connectivity data.
Gibson is an opensource perceptual and physics simulator to explore active and real-world perception. The Gibson Environment is used for Real-World Perception Learning.
HoME (Household Multimodal Environment) is a multimodal environment for artificial agents to learn from vision, audio, semantics, physics, and interaction with objects and other agents, all within a realistic context. HoME integrates over 45,000 diverse 3D house layouts based on the SUNCG dataset, a scale which may facilitate learning, generalization, and transfer. HoME is an open-source, OpenAI Gym-compatible platform extensible to tasks in reinforcement learning, language grounding, sound-based navigation, robotics, multi-agent learning, and more.
IMDb-Face is large-scale noise-controlled dataset for face recognition research. The dataset contains about 1.7 million faces, 59k identities, which is manually cleaned from 2.0 million raw images. All images are obtained from the IMDb website.
IP102 contains more than 75,000 images belonging to 102 categories, which exhibit a natural long-tailed distribution.
(JHU-CROWD) a crowd counting dataset that contains 4,250 images with 1.11 million annotations. This dataset is collected under a variety of diverse scenarios and environmental conditions. Specifically, the dataset includes several images with weather-based degradations and illumination variations in addition to many distractor images, making it a very challenging dataset. Additionally, the dataset consists of rich annotations at both image-level and head-level.
The MOldavian and ROmanian Dialectal COrpus (MOROCO) is a corpus that contains 33,564 samples of text (with over 10 million tokens) collected from the news domain. The samples belong to one of the following six topics: culture, finance, politics, science, sports and tech. The data set is divided into 21,719 samples for training, 5,921 samples for validation and another 5,924 samples for testing.
MosMedData contains anonymised human lung computed tomography (CT) scans with COVID-19 related findings, as well as without such findings. A small subset of studies has been annotated with binary pixel masks depicting regions of interests (ground-glass opacifications and consolidations). CT scans were obtained between 1st of March, 2020 and 25th of April, 2020, and provided by municipal hospitals in Moscow, Russia.
Paraphrase and Semantic Similarity in Twitter (PIT) presents a constructed Twitter Paraphrase Corpus that contains 18,762 sentence pairs.