1,019 machine learning datasets
1,019 dataset results
We propose the first standardized benchmark in multimodal continual learning for video data, defining protocols for training and metrics for evaluation. This standardized framework allows researchers to effectively compare models, driving advancements in AI systems that can continuously learn from diverse data sources.
A Ball-Collision Dataset (ABCD) serves as a comprehensive benchmark for investigating the interaction dynamics of moving objects within 3D environments. It includes multimodal recordings of ball trajectories, captured under various conditions, including different elevation angles, flight lengths, and speeds. This dataset contains raw event, rgb and IMU data collected from an FPGA-based drone and 3D motion capture data of the drone (static) and a moving ball.
Two versions of the dataset are offered: one is the full dataset used to train the models in our paper, and the other is a mini dataset for easier examination. Both versions include raw and postprocessed subsets of peeling, wiping and lifting. The raw videos of the tactile dataset used for generate the PCA embedding are also provided.
The TVPReid dataset contains 6559 pedestrian videos, each of which is annotated with two text descriptions, for a total of 13118 descriptions. The sentence descriptions are in a natural language style and contain rich details about the pedestrian's appearance, actions, and environmental elements that the pedestrian interacts with. The average sentence length of the TVPReid dataset is 30 words, and the longest sentence contains 83 words.
100 videos with varying danger levels (on a scale of 0-10) and different scenarios, annotated by 18 human annotators using our annotation pipeline to represent human perception and respective Vision Language model summaries for each of the videos as benchmarks for testing LLMs' danger perceptions.
Daily Activity Recordings for Artificial Intelligence (DARai, pronounced "Dahr-ree") is a multimodal, hierarchically annotated dataset constructed to understand human activities in real-world settings. DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data from 20 sensors including multiple camera views, depth and radar sensors, wearable inertial measurement units (IMUs), electromyography (EMG), insole pressure sensors, biomonitor sensors, and gaze tracker. To capture the complexity in human activities, DARai is annotated at three levels of hierarchy: (i) high-level activities (L1) that are independent tasks, (ii) lower-level actions (L2) that are patterns shared between activities, and (iii) fine-grained procedures (L3) that detail the exact execution steps for actions. The dataset annotations and recordings are designed so that 22.7% of L2 actions are shared between L1 activities and 14.2% of L3
The Watch Your Mouth dataset is a custom silent speech dataset consisting of depth-only recordings of users silently mouthing full English sentences, captured using consumer-grade depth cameras such as the iPhone TrueDepth sensor. Sentences were carefully curated to cover diverse visemic and phonetic patterns, supporting the development of models capable of generalizing across varied speech content. Each sentence-level utterance provides a temporally aligned depth sequence and corresponding ground truth text. Please see more details in the paper Watch Your Mouth: Silent Speech Recognition with Depth Sensing.
This dataset includes 3D point-cloud and 2D imagery from a flash LiDAR...
SOMPT22 is a multi-object tracking (MOT) benchmark focused on surveillance-style pedestrian tracking.
This dataset provides high-resolution videos recorded from three perspectives with more than 110 hours of total playtime showing mice solving complex tasks. We provide frame-level action labels that reflect a mouse's actions (in proximity to, touch, bite, lock, unlock, touch reward) with lockbox mechanisms (lever, stick, ball, sliding door) for 13% of the data.
Recognizing complex emotions linked to ambivalence and hesitancy (A/H) can play a critical role in the personalization and effectiveness of digital behaviour change interventions. These subtle and conflicting emotions are manifested by a discord between multiple modalities, such as facial and vocal expressions, and body language. Although experts can be trained to identify A/H, integrating them into digital interventions is costly and less effective. Automatic learning systems provide a cost-effective alternative that can adapt to individual users, and operate seamlessly within real-time, and resource-limited environments. However, there are currently no datasets available for the design of ML models to recognize A/H.
We create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. To ensure subject-information diversity in our dataset by, we (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image on raw frames to synthesize multi-view representations. The dataset supports both Subject-to-Video and Text-to-Video generation tasks.
The DAVIDE dataset consists of synchronized blurred, depth, and sharp videos. The dataset comprises 90 video sequences divided into 69 for training, 7 for validation, and 14 for testing. The test set includes annotations of seven content attributes categorized by: 1) environment (indoor/outdoor), 2) motion (camera motion/camera and object motion), and 3) scene proximity (close/mid/far). These annotations aim to facilitate further analysis into scenarios where depth information could be more beneficial.
The first and the one open dataset for Russian finger- spelling, contained 1,593 annotated phrases and over 37 thousand HD+ videos.
We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniT
🏃♂️ Open-HypermotionX Dataset Open-Hypermotion is a large-scale, high-quality dataset designed for training and evaluating pose-guided human image animation models, with a special focus on complex, dynamic human motions (Hypermotion), such as flips, spins, and acrobatics.
Aria Scenes is a benchmark dataset for future research on photorealistic reconstruction. The dataset includes 12 .vrs files created in diverse indoor and outdoor environments.
the dataset is a monkey doo doo dataset
Please refer to the Zenodo page for a detailed description: https://zenodo.org/records/15665101
The TED VCR Video Retrieval Dataset is a multimodal collection derived from publicly available TED Talks. It contains thousands of talks filtered to retain only those with meaningful topic labels, producing a long-tail, multi-label taxonomy. For each talk the dataset provides automatic speech-recognition transcripts, slide- and scene-level OCR text, and frame-level visual captions—three textual channels used in VCR retrieval experiments. The data are split into 80 % train, 10 % validation, and 10 % test while preserving the original topic distribution, leaving 542 talks as a held-out test set. Two ready-to-download archives accompany the release: 4.2 GB of trimmed MP4 videos with metadata and 1.8 GB of pre-computed CLIP and Whisper embeddings, both shared under the non-commercial CC BY-NC-ND 4.0 license.