TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

1,019 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

1,019 dataset results

MMPD (Multi-Domain Mobile Video Physiology Dataset)

The Multi-domain Mobile Video Physiology Dataset (MMPD), comprising 11 hours(1152K frames) of recordings from mobile phones of 33 subjects. The dataset was designed to capture videos with greater representation across skin tone, body motion, and lighting conditions. MMPD is comprehensive with eight descriptive labels and can be used in conjunction with the rPPG-toolbox and PhysBench. MMPD is widely used for rPPG tasks and remote heart rate estimation. To access the dataset, you are supposed to download this data release agreement and request downloading by email.

14 papers0 benchmarksImages, Medical, Time series, Videos

OVBench

OVBench is a benchmark tailored for real-time video understanding:

14 papers1 benchmarksTexts, Videos

EgoDexter

The EgoDexter dataset provides both 2D and 3D pose annotations for 4 testing video sequences with 3190 frames. The videos are recorded with body-mounted camera from egocentric viewpoints and contain cluttered backgrounds, fast camera motion, and complex interactions with various objects. Fingertip positions were manually annotated for 1485 out of 3190 frames.

13 papers0 benchmarksImages, RGB-D, Videos

EVE (End-to-end Video-based Eye-tracking)

EVE (End-to-end Video-based Eye-tracking) is a dataset for eye-tracking. It is collected from 54 participants and consists of 4 camera views, over 12 million frames and 1327 unique visual stimuli (images, video, text), adding up to approximately 105 hours of video data in total.

13 papers0 benchmarksVideos

WSVD (Web Stereo Video Dataset)

The Web Stereo Video Dataset consists of 553 stereoscopic videos from YouTube. This dataset has a wide variety of scene types, and features many nonrigid objects.

13 papers0 benchmarksStereo, Videos

TED-talks

In order to create the TED-talks dataset, 3,035 YouTube videos were downloaded using the "TED talks" query. From these initial candidates, videos in which the upper part of the person is visible for at least 64 frames, and the height of the person bounding box was at least 384 pixels were selected. Static videos were manually filtered out and videos in which a person is doing something other than presenting.

13 papers8 benchmarksVideos

BrnoCompSpeed

The dataset contains 21 full-HD videos, each around 1 hr long, captured at six different locations. Vehicles in the videos (20 865 instances in total) are annotated with the precise speed measurements from optical gates using LiDAR and verified with several reference GPS tracks. The dataset is available for download and it contains the videos and metadata (calibration, lengths of features in image, annotations, and so on) for future comparison and evaluation.

13 papers8 benchmarksVideos

Inter4K

A video dataset for benchmarking upsampling methods. Inter4K contains 1,000 ultra-high resolution videos with 60 frames per second (fps) from online resources. The dataset provides standardized video resolutions at ultra-high definition (UHD/4K), quad-high definition (QHD/2K), full-high definition (FHD/1080p), (standard) high definition (HD/720p), one quarter of full HD (qHD/520p) and one ninth of a full HD (nHD/360p). We use frame rates of 60, 50, 30, 24 and 15 fps for each resolution. Based on this standardization, both super-resolution and frame interpolation tests can be performed for different scaling sizes ($\times 2$, $\times 3$ and $\times 4$). In this paper, we use Inter4K to address frame upsampling and interpolation. Inter4K provides both standardized UHD resolution and 60 fps for all of videos by also containing a diverse set of 1,000 5-second videos. Differences between scenes originate from the equipment (e.g., professional 4K cameras or phones), lighting conditions, vari

13 papers0 benchmarksVideos

GEN1 Detection (Prophesee GEN1 Automotive Detection Dataset)

Prophesee’s GEN1 Automotive Detection Dataset is the largest Event-Based Dataset to date.

13 papers10 benchmarksVideos

UnAV-100

Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, we focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events covering 100 event categories. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. We believe our UnAV-100, with its realistic complexity, can promote the exploration on comprehensive audio-visual video understanding.

13 papers2 benchmarksAudio, Videos

DNA-Rendering

DNA-Rendering is a large-scale, high-fidelity repository of human performance data for neural actor rendering. It contains over 1500 human subjects, 5000 motion sequences, and 67.5M frames' data volume. Upon the massive collections, the authors provide human subjects with grand categories of pose actions, body shapes, clothing, accessories, hairdos, and object intersection, which ranges the geometry and appearance variances from everyday life to professional occasions. Second, they provide rich assets for each subject -- 2D/3D human body keypoints, foreground masks, SMPLX models, cloth/accessory materials, multi-view images, and videos. These assets boost the current method's accuracy on downstream rendering tasks. Third, they construct a professional multi-View system to capture data, which contains 60 synchronous cameras with max 4096×3000 resolution, 15 fps speed, and stern camera calibration steps, ensuring high-quality resources for task training and evaluation.

13 papers0 benchmarksImages, Videos

Touch and Go

This dataset encompasses a diverse range of tactile features that are instrumental in bifurcating various material properties. Three downstream tasks are considered: 1) categorization of materials, 2) distinction between hard and soft surfaces, and 3) distinction between smooth and textured surfaces.

13 papers0 benchmarksImages, Videos

HIC (Hands in Action)

The Hands in action dataset (HIC) dataset has RGB-D sequences of hands interacting with objects.

12 papers0 benchmarksImages, RGB-D, Videos

VOT2014 (Visual Object Tracking Challenge 2014)

The dataset comprises 25 short sequences showing various objects in challenging backgrounds. Eight sequences are from the VOT2013 challenge (bolt, bicycle, david, diving, gymnastics, hand, sunshade, woman). The new sequences show complementary objects and backgrounds, for example a fish underwater or a surfer riding a big wave. The sequences were chosen from a large pool of sequences using a methodology based on clustering visual features of object and background so that those 25 sequences sample evenly well the existing pool.

12 papers2 benchmarksImages, Tracking, Videos

DramaQA

The DramaQA focuses on two perspectives: 1) Hierarchical QAs as an evaluation metric based on the cognitive developmental stages of human intelligence. 2) Character-centered video annotations to model local coherence of the story. The dataset is built upon the TV drama "Another Miss Oh" and it contains 17,983 QA pairs from 23,928 various length video clips, with each QA pair belonging to one of four difficulty levels.

12 papers1 benchmarksVideos

RareAct

RareAct is a video dataset of unusual actions, including actions like “blend phone”, “cut keyboard” and “microwave shoes”. It aims at evaluating the zero-shot and few-shot compositionality of action recognition models for unlikely compositions of common action verbs and object nouns. It contains 122 different actions which were obtained by combining verbs and nouns rarely co-occurring together in the large-scale textual corpus from HowTo100M, but that frequently appear separately.

12 papers2 benchmarksVideos

UCFRep

The UCFRep dataset contains 526 annotated repetitive action videos. This dataset is built from the action recognition dataset UCF101.

12 papers4 benchmarksVideos

VideoSet

VideoSet is a large-scale compressed video quality dataset based on just-noticeable-difference (JND) measurement.

12 papers0 benchmarksVideos

YT-UGC (YouTube UGC)

YT-UGC is a large scale UGC (User Generated Content) dataset (1,500 20 sec video clips) sampled from millions of YouTube videos. The dataset covers popular categories like Gaming, Sports, and new features like High Dynamic Range (HDR). This dataset can be used to study video compression and quality assessment.

12 papers0 benchmarksVideos

Hyper-Kvasir Dataset

HyperKvasir dataset contains 110,079 images and 374 videos where it captures anatomical landmarks and pathological and normal findings. A total of around 1 million images and video frames altogether.

12 papers3 benchmarksBiomedical, Images, Medical, Videos
PreviousPage 17 of 51Next