TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

1,019 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

1,019 dataset results

ViLCo (ViLCo-Bench)

We propose the first standardized benchmark in multimodal continual learning for video data, defining protocols for training and metrics for evaluation. This standardized framework allows researchers to effectively compare models, driving advancements in AI systems that can continuously learn from diverse data sources.

1 papers0 benchmarksImages, Texts, Videos

A Ball Collision Dataset (ABCD)

A Ball-Collision Dataset (ABCD) serves as a comprehensive benchmark for investigating the interaction dynamics of moving objects within 3D environments. It includes multimodal recordings of ball trajectories, captured under various conditions, including different elevation angles, flight lengths, and speeds. This dataset contains raw event, rgb and IMU data collected from an FPGA-based drone and 3D motion capture data of the drone (static) and a moving ball.

1 papers1 benchmarksImages, Videos

Reactive Diffusion Policy-Dataset (Dataset of Reactive Diffusion Policy)

Two versions of the dataset are offered: one is the full dataset used to train the models in our paper, and the other is a mini dataset for easier examination. Both versions include raw and postprocessed subsets of peeling, wiping and lifting. The raw videos of the tactile dataset used for generate the PCA embedding are also provided.

1 papers0 benchmarksActions, Images, Videos

TVPReid (Text-to-Video Person Re-identification)

The TVPReid dataset contains 6559 pedestrian videos, each of which is annotated with two text descriptions, for a total of 13118 descriptions. The sentence descriptions are in a natural language style and contain rich details about the pedestrian's appearance, actions, and environmental elements that the pedestrian interacts with. The average sentence length of the TVPReid dataset is 30 words, and the longest sentence contains 83 words.

1 papers0 benchmarksTexts, Videos

ViDAS

100 videos with varying danger levels (on a scale of 0-10) and different scenarios, annotated by 18 human annotators using our annotation pipeline to represent human perception and respective Vision Language model summaries for each of the videos as benchmarks for testing LLMs' danger perceptions.

1 papers0 benchmarksTexts, Videos

DARai (Daily Activity Recordings for AI and ML applications)

Daily Activity Recordings for Artificial Intelligence (DARai, pronounced "Dahr-ree") is a multimodal, hierarchically annotated dataset constructed to understand human activities in real-world settings. DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data from 20 sensors including multiple camera views, depth and radar sensors, wearable inertial measurement units (IMUs), electromyography (EMG), insole pressure sensors, biomonitor sensors, and gaze tracker. To capture the complexity in human activities, DARai is annotated at three levels of hierarchy: (i) high-level activities (L1) that are independent tasks, (ii) lower-level actions (L2) that are patterns shared between activities, and (iii) fine-grained procedures (L3) that detail the exact execution steps for actions. The dataset annotations and recordings are designed so that 22.7% of L2 actions are shared between L1 activities and 14.2% of L3

1 papers0 benchmarksBiomedical, Environment, Images, LiDAR, RGB-D, Time series, Videos

Watch Your Mouth: Point Clouds based Speech Recognition Dataset

The Watch Your Mouth dataset is a custom silent speech dataset consisting of depth-only recordings of users silently mouthing full English sentences, captured using consumer-grade depth cameras such as the iPhone TrueDepth sensor. Sentences were carefully curated to cover diverse visemic and phonetic patterns, supporting the development of models capable of generalizing across varied speech content. Each sentence-level utterance provides a temporally aligned depth sequence and corresponding ground truth text. Please see more details in the paper Watch Your Mouth: Silent Speech Recognition with Depth Sensing.

1 papers0 benchmarksPoint cloud, Speech, Videos

Remote Flash LiDAR Vehicles Dataset

This dataset includes 3D point-cloud and 2D imagery from a flash LiDAR...

1 papers6 benchmarks3D, Images, LiDAR, Point cloud, Videos

SOMPT22 (Surveillance Oriented Multi-Pedestrian Tracking Dataset (SOMPT22))

SOMPT22 is a multi-object tracking (MOT) benchmark focused on surveillance-style pedestrian tracking.

1 papers0 benchmarksImages, Tracking, Videos

MLB (Mouse Lockbox Dataset)

This dataset provides high-resolution videos recorded from three perspectives with more than 110 hours of total playtime showing mice solving complex tasks. We provide frame-level action labels that reflect a mouse's actions (in proximity to, touch, bite, lock, unlock, touch reward) with lockbox mechanisms (lever, stick, ball, sliding door) for 13% of the data.

1 papers0 benchmarksVideos

BAH (Behavioural Ambivalence/Hesitancy)

Recognizing complex emotions linked to ambivalence and hesitancy (A/H) can play a critical role in the personalization and effectiveness of digital behaviour change interventions. These subtle and conflicting emotions are manifested by a discord between multiple modalities, such as facial and vocal expressions, and body language. Although experts can be trained to identify A/H, integrating them into digital interventions is costly and less effective. Automatic learning systems provide a cost-effective alternative that can adapt to individual users, and operate seamlessly within real-time, and resource-limited environments. However, there are currently no datasets available for the design of ML models to recognize A/H.

1 papers0 benchmarksAudio, Texts, Videos

OpenS2V-5M

We create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. To ensure subject-information diversity in our dataset by, we (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image on raw frames to synthesize multi-view representations. The dataset supports both Subject-to-Video and Text-to-Video generation tasks.

1 papers0 benchmarksImages, Texts, Videos

DAVIDE ('Depth-Aware VIdeo DEblurring')

The DAVIDE dataset consists of synchronized blurred, depth, and sharp videos. The dataset comprises 90 video sequences divided into 69 for training, 7 for validation, and 14 for testing. The test set includes annotations of seven content attributes categorized by: 1) environment (indoor/outdoor), 2) motion (camera motion/camera and object motion), and 3) scene proximity (close/mid/far). These annotations aim to facilitate further analysis into scenarios where depth information could be more beneficial.

1 papers0 benchmarksRGB-D, Videos

Znaki

The first and the one open dataset for Russian finger- spelling, contained 1,593 annotated phrases and over 37 thousand HD+ videos.

1 papers1 benchmarksImages, Texts, Videos

UniTalk

We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniT

1 papers0 benchmarksAudio, Videos

Open-HypermotionX

🏃‍♂️ Open-HypermotionX Dataset Open-Hypermotion is a large-scale, high-quality dataset designed for training and evaluating pose-guided human image animation models, with a special focus on complex, dynamic human motions (Hypermotion), such as flips, spins, and acrobatics.

1 papers0 benchmarksVideos

Aria Scene Datasets

Aria Scenes is a benchmark dataset for future research on photorealistic reconstruction. The dataset includes 12 .vrs files created in diverse indoor and outdoor environments.

1 papers0 benchmarksVideos

monkey doo

the dataset is a monkey doo doo dataset

1 papers0 benchmarksImages, Videos

Selective Visual Attention Decoding Dataset KU Leuven

Please refer to the Zenodo page for a detailed description: https://zenodo.org/records/15665101

1 papers0 benchmarksEEG, Videos

TED VCR

The TED VCR Video Retrieval Dataset is a multimodal collection derived from publicly available TED Talks. It contains thousands of talks filtered to retain only those with meaningful topic labels, producing a long-tail, multi-label taxonomy. For each talk the dataset provides automatic speech-recognition transcripts, slide- and scene-level OCR text, and frame-level visual captions—three textual channels used in VCR retrieval experiments. The data are split into 80 % train, 10 % validation, and 10 % test while preserving the original topic distribution, leaving 542 talks as a held-out test set. Two ready-to-download archives accompany the release: 4.2 GB of trimmed MP4 videos with metadata and 1.8 GB of pre-computed CLIP and Whisper embeddings, both shared under the non-commercial CC BY-NC-ND 4.0 license.

1 papers0 benchmarksImages, Speech, Texts, Videos
PreviousPage 48 of 51Next