Datasets

1,019 machine learning datasets

1,019 dataset results

MMAC Captions

We provide a dataset called MMAC Captions for sensor-augmented egocentric-video captioning. The dataset contains 5,002 activity descriptions by extending the CMU-MMAC dataset. A number of activity description examples can be found in the homepage.

2 papers0 benchmarksTime series, Videos

Dynamic OLAT Dataset (ShanghaiTech MARS Dynamic OLAT Dataset)

To provide ground truth supervision for video consistency modeling, we build up a high-quality dynamic OLAT dataset. Our capture system consists of a light stage setup with 114 LED light sources and Phantom Flex4K-GS camera (global shutter, stationary 4K ultra-high-speed camera at 1000 fps), resulting in dynamic OLAT imageset recording at 25 fps using the overlapping method. Our dynamic OLAT dataset provides sufficient semantic, temporal and lighting consistency supervision to train our neural video portrait relighting scheme, which can generalize to in-the-wild scenarios.

2 papers0 benchmarksImages, RGB Video, Videos

IACC.3 (Internet Archive videos (IACC.3) under Creative Commons licenses.)

The IACC.3 dataset is approximately 4600 Internet Archive videos (144 GB, 600 h) with Creative Commons licenses in MPEG-4/H.264 format with duration ranging from 6.5 min to 9.5 min and a mean duration of almost 7.8 min. Most videos will have some metadata provided by the donor available e.g., title, keywords, and description.

2 papers0 benchmarksVideos

3MASSIV

A multilingual, multimodal and multi-aspect, expertly-annotated dataset of diverse short videos extracted from short-video social media platform - Moj. 3MASSIV comprises of 50k short videos (~20 seconds average duration) and 100K unlabeled videos in 11 different languages and captures popular short video trends like pranks, fails, romance, comedy expressed via unique audio-visual formats like self-shot videos, reaction videos, lip-synching, self-sung songs, etc.

2 papers0 benchmarksVideos

AVCAffe (A Large Scale Audio-Visual Dataset of Cognitive Load and Affect for Remote Work)

We introduce AVCAffe, the first Audio-Visual dataset consisting of Cognitive load and Affect attributes. We record AVCAffe by simulating remote work scenarios over a video-conferencing platform, where subjects collaborate to complete a number of cognitively engaging tasks. AVCAffe is the largest originally collected (not collected from the Internet) affective dataset in English language. We recruit 106 participants from 18 different countries of origin, spanning an age range of 18 to 57 years old, with a balanced male-female ratio. AVCAffe comprises a total of 108 hours of video, equivalent to more than 58,000 clips along with task-based self-reported ground truth labels for arousal, valence, and cognitive load attributes such as mental demand, temporal demand, effort, and a few others. We believe AVCAffe would be a challenging benchmark for the deep learning research community given the inherent difficulty of classifying affect and cognitive load in particular. Moreover, our dataset f

2 papers0 benchmarksAudio, Videos

IBISCape

A Simulated Benchmark for multi-modal SLAM Systems Evaluation in Large-scale Dynamic Environments.

2 papers0 benchmarksEnvironment, Images, Point cloud, RGB Video, RGB-D, Stereo, Videos

CAP (Consented Activities of People)

The Consented Activities of People (CAP) dataset is a fine grained activity dataset for visual AI research curated using the Visym Collector platform. The CAP dataset contains annotated videos of fine-grained activity classes of consented people. Videos are recorded from mobile devices around the world from a third person viewpoint looking down on the scene from above, containing subjects performing every day activities. Videos are annotated with bounding box tracks around the primary actor along with temporal start/end frames for each activity instance, and distributed in vipy json format. An interactive visualization and video summary is available for review in the dataset distribution site.

2 papers0 benchmarksVideos

Breast Lesion Detection in Ultrasound Videos (CVA-Net)

The breast lesion detection in ultrasound videos dataset uses a clip-level and video-level feature aggregated network (CVA-Net) and consists of 188 ultrasound videos, of which 113 are labeled malignant and 75 benign. Overall these consist of 25,272 ultrasound images in total with the number of images for each video varying from 28 to 413. 150 videos were used for training, 38 for testing. The primary intended use case would be for computer-aided breast cancer diagnosis, supporting systems to assist radiologists.

2 papers0 benchmarksImages, Medical, Videos

GAFA (Gaze from afar dataset)

We introduce a new dataset of annotated surveillance videos of freely moving people taken from a distance in both indoor and outdoor scenes. The videos are captured with multiple cameras placed in eight different daily environments. People in the videos undergo large pose variations and are frequently occluded by various environmental factors. Most important, their eyes are mostly not clearly visible as is often the case in surveillance videos. We introduce the first rigorously annotated dataset of 3D gaze directions of freely moving people captured from afar.

2 papers0 benchmarks3D, Videos

Rhythmic Gymnastic

The Rhythmic Gymnastics dataset contains videos of four different types of gymnastics routines: ball, clubs, hoop and ribbon. Each type of routine has 250 associated videos, and the length of each video is approximately 1 min 35 s. We chose high-standard international competition videos, including videos from the 36th and 37th International Artistic Gymnastics Competitions, to construct the dataset. We have edited out the irrelevant parts of the original videos (such as replay shots and athlete warmups). We have annotated each video with three scores (a difficulty score, an execution score and a total score), which were given by the referee in accordance with the official scoring system.

2 papers1 benchmarksVideos

Plittersdorf

A set of 221 stereo videos captured by the SOCRATES stereo camera trap in a wildlife park in Bonn, Germany between February and July of 2022. A subset of frames is labeled with instance annotations in the COCO format.

2 papers0 benchmarksImages, RGB-D, Videos

NSVA (NBA dataset for Sports Video Analysis)

NVSA is a large-scale NBA dataset for Sports Video Analysis (NSVA) with a focus on sports video captioning. This dataset consists of more than 32K video clips and it is also designed to address two additional tasks, namely fine-grained sports action recognition and salient player identification.

2 papers0 benchmarksVideos

JRDB-Pose

JRDB-Pose is a large-scale dataset and benchmark for multi-person pose estimation and tracking using videos captured from a social navigation robot. The dataset contains challenge scenes with crowded indoor and outdoor locations and a diverse range of scales and occlusion types. It provides human pose annotations with per-keypoint occlusion labels and tack IDs consistent across the scene. These annotations include 600,000 human body pose annotations and 600,000 head bounding box annotations.

2 papers0 benchmarksVideos

CRIPP-VQA (Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering)

CRIPP-VQA is a video question answering dataset for reasoning about the implicit physical properties of objects in a scene. It contains videos of object in motion, annotated with questions that involve counterfactual reasoning about actions, questions about planning in order to reach a goal, and descriptive questions about visible properties of objects.

2 papers0 benchmarksVideos

CCTV-Pipe

Our CCTV-Pipe dataset consists of 16 defect categories including structural and functional defects in the pipe. It contains 575 videos with 87 hours, which are collected from real-world urban pipe systems. Different from traditional temporal action localization, our goal in this realistic scenario is to find preferable temporal locations of defects from a untrimmed CCTV video, instead of exact temporal boundaries.

2 papers0 benchmarksVideos

Cinescale (CineScale: A dataset of cinematic shot scale in movies)

We provide a database containing shot scale annotations (i.e., the apparent distance of the camera from the subject of a filmed scene) for more than 792,000 image frames. Frames belong to 124 full movies from the entire filmographies by 6 important directors: Martin Scorsese, Jean-Luc Godard, Béla Tarr, Federico Fellini, Michelangelo Antonioni, and Ingmar Bergman. Each frame, extracted from videos at 1 frame per second, is annotated on the following scale categories: Extreme Close Up (ECU), Close Up (CU), Medium Close Up (MCU), Medium Shot (MS), Medium Long Shot (MLS), Long Shot (LS), Extreme Long Shot (ELS), Foreground Shot (FS), and Insert Shots (IS). Two independent coders annotated all frames from the 124 movies, whilst a third one checked their coding and made decisions in cases of disagreement. The CineScale database enables AI-driven interpretation of shot scale data and opens to a large set of research activities related to the automatic visual analysis of cinematic material, s

2 papers0 benchmarksImages, Videos

Lyra Dataset (A Dataset for Greek Traditional and Folk Music)

Lyra is a dataset of 1570 traditional and folk Greek music pieces that includes audio and video (timestamps and links to YouTube videos), along with annotations that describe aspects of particular interest for this dataset, including instrumentation, geographic information and labels of genre and subgenre, among others.

2 papers0 benchmarksAudio, Music, Videos

RGBD1K (A Large-scale Dataset and Benchmark for RGB-D Object Tracking)

RGBD1K is a benchmark for RGB-D Object Tracking which contains 1050 sequences with about 2.5M frames in total.

2 papers0 benchmarksRGB-D, Videos

Lindenthal Camera Traps

This data set contains 775 video sequences, captured in the wildlife park Lindenthal (Cologne, Germany) as part of the AMMOD project, using an Intel RealSense D435 stereo camera. In addition to color and infrared images, the D435 is able to infer the distance (or “depth”) to objects in the scene using stereo vision. Observed animals include various birds (at daytime) and mammals such as deer, goats, sheep, donkeys, and foxes (primarily at nighttime). A subset of 412 images is annotated with a total of 1038 individual animal annotations, including instance masks, bounding boxes, class labels, and corresponding track IDs to identify the same individual over the entire video.

2 papers0 benchmarksImages, RGB-D, Stereo, Videos

Argoverse 2 Map Change

The Argoverse 2 Map Change Dataset is a collection of 1,000 scenarios with ring camera imagery, lidar, and HD maps. Two hundred of the scenarios include changes in the real-world environment that are not yet reflected in the HD map, such as new crosswalks or repainted lanes. By sharing a map dataset that labels the instances in which there are discrepancies with sensor data, we encourage the development of novel methods for detecting out-of-date map regions.

2 papers0 benchmarksLiDAR, Videos

PreviousPage 34 of 51Next