TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

1,019 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

1,019 dataset results

VILT (Video Instructions Linking for Complex Tasks)

VILT is a new benchmark collection of tasks and multimodal video content. The video linking collection includes annotations from 10 (recipe) tasks, which the annotators chose from a random subset of the collection of 2,275 high-quality 'Wholefoods' recipes. There are linking annotations for 61 query steps across these tasks which contain cooking techniques, chosen from the 189 total recipe steps. As each method results in approximately 10 videos to annotate, the collection consists of 831 linking judgments.

1 papers0 benchmarksVideos

VISOR - Semi supervised video object segmentation (val)

VISOR is a dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, and it contains 272K manual semantic masks of 257 object classes, 9.9M interpolated dense masks, and 67K hand-object relations, covering 36 hours of 179 untrimmed videos.

1 papers0 benchmarksImages, Videos

Trailers12k

Trailers12k is a movie trailer dataset comprised of 12,000 titles associated to ten genres. It distinguishes from other datasets by its collection procedure aimed at providing a high-quality publicly available dataset.

1 papers0 benchmarksImages, Texts, Videos

Virtual-Pedcross-4667

Virtual-PedCross-4667 is a dataset for pedestrian crossing prediction. It consists of 4667 video sequences, 2862 pedestrian crossing sequences and 1804 not-crossing sequences. Totally, 745k video frames with the resolution of 1280×720 are saved.

1 papers0 benchmarksVideos

FISHTRAC (Nguyen Minh Khiem)

A dataset of real-world underwater videos annotated with multi-object tracking labels. The data was collected of the coast of the Big Island of Hawaii and the primary goal is to help scientists studying fish behavior, with the goal of conserving rare and beautiful fish species.

1 papers0 benchmarksVideos

DIO (Discovering Interacted Objects)

Discovering Interacted Objects (DIO) is a benchmark containing 51 interactions and 1,000+ objects designed for Spatio-temporal Human-Object Interaction (ST-HOI) detection.

1 papers0 benchmarksVideos

INDRA (INdian Dataset for RoAd crossing)

INDRA is a dataset capturing videos of Indian roads from the pedestrian point-of-view. INDRA contains 104 videos comprising of 26k 1080p frames, each annotated with a binary road crossing safety label and vehicle bounding boxes.

1 papers0 benchmarksVideos

Vident-lab

Vident-lab is a dataset of dental videos with multi-task labels to facilitate further research in relevant video processing applications. The dataset constitutes a low-quality frame, its high-quality counterpart, a teeth segmentation mask, and an inter-frame homography matrix. The homography warps the current frame to the previous frame with respect to the teeth. The dataset has the training, validation, and test sets of 300, 29, and 80 videos, respectively.

1 papers0 benchmarksVideos

MOET

MOET a dataset consists of gaze data from participants tracking specific objects, annotated with labels and bounding boxes, in crowded real-world videos, for training and evaluating attention decoding algorithms.

1 papers0 benchmarksVideos

YouwikiHow

YouwikiHow is a dataset for Weakly-Supervised temporal Article Grounding (WSAG). It contains 47K videos and an average of 20.8 query sentences for each video.

1 papers0 benchmarksTexts, Videos

MH-FED (Meta Human Facial Expression Dataset)

This dataset provides a collection of 162K images and 70 Videos of Meta-Humans. There are 10 Highly realistic Meta-Humans expressing 7 facial expressions.

1 papers0 benchmarksImages, Videos

42Street

The 42Street dataset is based on a theater play as an example of such an application. The dataset is created using a public recording of the 42Street theatre play [42street]. The play is 1.5 hours long and was split into 5 equally long parts of 20 minutes each, with various clothes changes between the different parts.

1 papers0 benchmarksVideos

VISEM-Tracking

VISEM-Tracking is a dataset consisting of 20 video recordings of 30s of spermatozoa with manually annotated bounding-box coordinates and a set of sperm characteristics analyzed by experts in the domain. It is an extension of the previously published VISEM dataset. In addition to the annotated data, unlabeled video clips are provided for easy-to-use access and analysis of the data.

1 papers0 benchmarksBiology, Medical, Videos

STVD-PVCD (Partial Video Copy Detection Dataset)

STVD is the largest public dataset on the PVCD task. It was constituted with about 83 thousands of videos having in total of more than 10 thousands of hours duration and including more than 420 thousands of video copy pairs. It offers different test sets for a fine performance characterization (frame degradation, global transformation, video speeding, etc.) with a frame level annotation for the real-time detection and video alignment. Baseline comparisons were reported to show a room for improvement. More information about the STVD dataset can be found into the publications [1, 2].

1 papers1 benchmarksVideos

Accidental Turntables

Accidental Turntables contains a challenging set of 41,212 images of cars in cluttered backgrounds, motion blur and illumination changes that serves as a benchmark for 3D pose estimation.

1 papers0 benchmarksImages, Videos

Werewolf Among Us

Werewolf Among Us is a dataset multimodal dataset for modeling persuasion behaviors. It contains 199 dialogue transcriptions and videos captured in a multi-player social deduction game setting, 26,647 utterance level annotations of persuasion strategy, and game level annotations of deduction game outcomes.

1 papers0 benchmarksDialog, Videos

CAP-DATA

CAP-DATA is a large-scale benchmark consisting of 11,727 in-the-wild accident videos with over 2.19 million frames together with labeled fact-effect-reason-introspection description and temporal accident frame label. It can support many useful tasks for accident inference, such as accident detection and prediction (AccidentDet/Pre), causal inference of accident (Accident-Causal), accident classification (Accident-Cla), text-video based accident retrieval (Accident-Retri), and question answering in an accident (Accident-QA) of the driving scene.

1 papers0 benchmarksVideos

QoEVAVE (Quality of Experience Evaluation of Interactive Virtual Environments with Audiovisual Scenes)

Quality of Experience Evaluation of Interactive Virtual Environments with Audiovisual Scenes (QoEVAVE) provides an initial audiovisual database consiting of 12 sequences capturing real-life nature and urban scenes. The maximum video resolution is 7680x3840 (8k) at 60 frames-per-second, with 4th-order Ambisonics spatial audio (4OA). All video sequences are recorded with a minumum target duration of 60 seconds and designed to represent real-life settings for systematically evaluating various dimensions of uni-/multimodal perception, cognition, behavior, and quality of experience (QoE) in a controlled virtual environment. This database serves as a novel high-quality reference material with an equal focus on auditory and visual sensory information within the QoE community.

1 papers0 benchmarksVideos

RTB (Robot Tracking Benchmark)

The Robot Tracking Benchmark (RTB) is a synthetic dataset that facilitates the quantitative evaluation of 3D tracking algorithms for multi-body objects. It was created using the procedural rendering pipeline BlenderProc. The dataset contains photo-realistic sequences with HDRi lighting and physically-based materials. Perfect ground truth annotations for camera and robot trajectories are provided in the BOP format. Many physical effects, such as motion blur, rolling shutter, and camera shaking, are accurately modeled to reflect real-world conditions. For each frame, four depth qualities exist to simulate sensors with different characteristics. While the first quality provides perfect ground truth, the second considers measurements with the distance-dependent noise characteristics of the Azure Kinect time-of-flight sensor. Finally, for the third and fourth quality, two stereo RGB images with and without a pattern from a simulated dot projector were rendered. Depth images were then recons

1 papers2 benchmarks3D, 3d meshes, 6D, Images, RGB-D, Tracking, Videos

Dubbing Test Set

Dubbing Test Set consists of two subsets extracted from the En→De test set of COVOST-2, a large-scale multilingual speech translation corpus based on Common Voice. Specifically, the first subset is created by randomly sampling 91 sentences (test91), while the second is randomly sampled 101 sentences from the longest 10% of the De part of the test set (test101).

1 papers0 benchmarksTexts, Videos
PreviousPage 42 of 51Next