Datasets

87 machine learning datasets

87 dataset results

Cambridge Landmarks

Cambridge Landmarks, a large scale outdoor visual relocalisation dataset taken around Cambridge University. Contains original video, with extracted image frames labelled with their 6-DOF camera pose and a visual reconstruction of the scene. If you use this data, please cite our paper: Alex Kendall, Matthew Grimes and Roberto Cipolla "PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization." Proceedings of the International Conference on Computer Vision (ICCV), 2015.

109 papers0 benchmarksImages, RGB Video

CSL-Daily

CSL-Daily (Chinese Sign Language Corpus) is a large-scale continuous SLT dataset. It provides both spoken language translations and gloss-level annotations. The topic revolves around people's daily lives (e.g., travel, shopping, medical care), the most likely SLT application scenario.

63 papers2 benchmarksRGB Video, Texts, Videos

CUHK Avenue

Avenue Dataset contains 16 training and 21 testing video clips. The videos are captured in CUHK campus avenue with 30652 (15328 training, 15324 testing) frames in total.

48 papers10 benchmarksRGB Video

UAV-Human

UAV-Human is a large dataset for human behavior understanding with UAVs. It contains 67,428 multi-modal video sequences and 119 subjects for action recognition, 22,476 frames for pose estimation, 41,290 frames and 1,144 identities for person re-identification, and 22,263 frames for attribute recognition. The dataset was collected by a flying UAV in multiple urban and rural districts in both daytime and nighttime over three months, hence covering extensive diversities w.r.t subjects, backgrounds, illuminations, weathers, occlusions, camera motions, and UAV flying attitudes. This dataset can be used for UAV-based human behavior understanding, including action recognition, pose estimation, re-identification, and attribute recognition.

47 papers38 benchmarksRGB Video, RGB-D

How2Sign (A Large-scale Multimodal Dataset for Continuous American Sign Language)

The How2Sign is a multimodal and multiview continuous American Sign Language (ASL) dataset consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth. A three-hour subset was further recorded in the Panoptic studio enabling detailed 3D pose estimation.

44 papers5 benchmarks3D, RGB Video, RGB-D, Texts

V2V4Real

V2V4Real is a large-scale real-world multi-modal dataset for V2V perception. The data is collected by two vehicles equipped with multi modal sensors driving together through diverse scenarios. It covers a driving area of 410 km comprising 20K LiDAR frames, 40K RGB frames, 240K annotated 3D bounding boxes for 5 classes, and HDMaps that cover all the driving routes.

35 papers0 benchmarksLiDAR, RGB Video, Videos

V2X-SIM

V2X-Sim, short for vehicle-to-everything simulation, is the a synthetic collaborative perception dataset in autonomous driving developed by AI4CE Lab at NYU and MediaBrain Group at SJTU to facilitate collaborative perception between multiple vehicles and roadside infrastructure. Data is collected from both roadside and vehicles when they are presented near the same intersection. With information from both the roadside infrastructure and vehicles, the dataset aims to encourage research on collaborative perception tasks.

32 papers24 benchmarksPoint cloud, RGB Video, Videos

EMDB

EMDB contains in-the-wild videos of human activity recorded with a hand-held iPhone. It features reference SMPL body pose and shape parameters, as well as global body root and camera trajectories. The reference 3D poses were obtained by jointly fitting SMPL to 12 body-worn electromagnetic sensors and image data. For the latter we fit a neural implicit avatar model to allow for a dense pixel-wise fitting objective.

32 papers36 benchmarks3D, Images, RGB Video, Videos

UVO (Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation)

UVO is a new benchmark for open-world class-agnostic object segmentation in videos. Besides shifting the problem focus to the open-world setup, UVO is significantly larger, providing approximately 8 times more videos compared with DAVIS, and 7 times more mask (instance) annotations per video compared with YouTube-VOS and YouTube-VIS. UVO is also more challenging as it includes many videos with crowded scenes and complex background motions. Some highlights of the dataset include:

27 papers4 benchmarksImages, RGB Video, Videos

VOID (Visual Odometry with Inertial and Depth)

The dataset was collected using the Intel RealSense D435i camera, which was configured to produce synchronized accelerometer and gyroscope measurements at 400 Hz, along with synchronized VGA-size (640 x 480) RGB and depth streams at 30 Hz. The depth frames are acquired using active stereo and is aligned to the RGB frame using the sensor factory calibration. All the measurements are timestamped.

26 papers4 benchmarksImages, Point cloud, RGB Video, RGB-D

EasyCom

The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality (AR) -motivated multi-sensor egocentric world view. The dataset contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech transcriptions, head and face bounding boxes and source identification labels. We have created and are releasing this dataset to facilitate research in multi-modal AR solutions to the cocktail party problem.

22 papers15 benchmarksAudio, Dialog, Images, RGB Video, Speech, Time series, Videos

STARSS23 (STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events)

The Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset contains multichannel recordings of sound scenes in various rooms and environments, together with temporal and spatial annotations of prominent events belonging to a set of target classes. The dataset is collected in two different countries, in Tampere, Finland by the Audio Researh Group (ARG) of Tampere University (TAU), and in Tokyo, Japan by SONY, using a similar setup and annotation procedure. The dataset is delivered in two 4-channel spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). These recordings serve as the development dataset for the DCASE 2023 Sound Event Localization and Detection Task of the DCASE 2023 Challenge.

21 papers0 benchmarksAudio, RGB Video

BURST

BURST is a benchmark suite built upon TAO that requires tracking and segmenting multiple objects from camera video. The benchmark contains 6 different sub-tasks divided into 2 groups that all share the same data for training/validation/testing.

18 papers12 benchmarksRGB Video

SCAND (Socially CompliAnt Navigation Dataset)

Have you wondered how autonomous mobile robots should share space with humans in public spaces? Are you interested in developing autonomous mobile robots that can navigate within human crowds in a socially compliant manner? Do you want to analyze human reactions and behaviors in the presence of mobile robots of different morphologies?

18 papers0 benchmarksActions, LiDAR, Point cloud, RGB Video, RGB-D, Videos

SUN-SEG-Easy (Unseen)

The SUN-SEG dataset is a high-quality per-frame annotated VPS dataset, which includes 158,690 frames from the famous SUN dataset. It extends the labels with diverse types, i.e., object mask, boundary, scribble, polygon, and visual attribute. It also introduces the pathological information from the original SUN dataset, including pathological classification labels, location information, and shape information.

17 papers7 benchmarksMedical, RGB Video, Videos

SUN-SEG-Hard (Unseen)

17 papers7 benchmarksMedical, RGB Video, Videos

ViViD++ (Vision for Visibility Dataset)

A dataset capturing diverse visual data formats that target varying luminance conditions, and was recorded from alternative vision sensors, by handheld or mounted on a car, repeatedly in the same space but in different conditions.

17 papers0 benchmarks3D, Images, LiDAR, RGB Video, RGB-D

Memento10k

Memorability dataset with 10000 3-second videos. Each video has upwards of 90 human annotations, and the split-half consistency of this dataset is 0.73 (best in class for video memorabilty datasets).

17 papers0 benchmarksImages, RGB Video, Videos

RWF-2000

A database with 2,000 videos captured by surveillance cameras in real-world scenes.

16 papers1 benchmarksRGB Video

CCVID (Clothes-Changing Video person re-ID)

Clothes-Changing Video person re-ID (CCVID) is a dataset constructed from the raw data of a gait recognition dataset, i.e. FVG. The reconstructed CCVID dataset contains 347,833 bounding boxes. The length of each sequence changes from 27 to 410 frames, with an average length of 122. Besides, it also provides fine-grained clothes labels including tops, bottoms, shoes, carrying status, and accessories. For the convenience of evaluation, CCVID re-divides the training and test sets to adapt to clothes-changing re-id. Specifically, 75 identities are reserved for training, and the remaining 151 identities are used for test. In the test set, 834 sequences are used as query set, and the other 1074 sequences form gallery set.

15 papers3 benchmarksRGB Video

Page 1 of 5Next