Datasets

1,019 machine learning datasets

1,019 dataset results

DDAD (Dense Depth for Autonomous Driving)

DDAD is a new autonomous driving benchmark from TRI (Toyota Research Institute) for long range (up to 250m) and dense depth estimation in challenging and diverse urban conditions. It contains monocular videos and accurate ground-truth depth (across a full 360 degree field of view) generated from high-density LiDARs mounted on a fleet of self-driving cars operating in a cross-continental setting. DDAD contains scenes from urban settings in the United States (San Francisco, Bay Area, Cambridge, Detroit, Ann Arbor) and Japan (Tokyo, Odaiba).

73 papers10 benchmarksImages, Videos

PKU-MMD

The PKU-MMD dataset is a large skeleton-based action detection dataset. It contains 1076 long untrimmed video sequences performed by 66 subjects in three camera views. 51 action categories are annotated, resulting almost 20,000 action instances and 5.4 million frames in total. Similar to NTU RGB+D, there are also two recommended evaluate protocols, i.e. cross-subject and cross-view.

72 papers44 benchmarksImages, Videos

YouTube-UGC (YouTube UGC dataset)

This YouTube dataset is a sampling from thousands of User Generated Content (UGC) as uploaded to YouTube distributed under the Creative Commons license. This dataset was created in order to assist in the advancement of video compression and quality assessment research of UGC videos.

70 papers3 benchmarksVideos

OmniObject3D

OmniObject3D is a large vocabulary 3D object dataset with massive high-quality real-scanned 3D objects. OmniObject3D has several appealing properties:

70 papers0 benchmarks3D, 3d meshes, Images, Point cloud, Videos

MSP-IMPROV (MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception)

We present the MSP-IMPROV corpus, a multimodal emotional database, where the goal is to have control over lexical content and emotion while also promoting naturalness in the recordings. Studies on emotion perception often require stimuli with fixed lexical content, but that convey different emotions. These stimuli can also serve as an instrument to understand how emotion modulates speech at the phoneme level, in a manner that controls for coarticulation. Such audiovisual data are not easily available from natural recordings. A common solution is to record actors reading sentences that portray different emotions, which may not produce natural behaviors. We propose an alternative approach in which we define hypothetical scenarios for each sentence that are carefully designed to elicit a particular emotion. Two actors improvise these emotion-specific situations, leading them to utter contextualized, non-read renditions of sentences that have fixed lexical content and convey different emot

70 papers2 benchmarksAudio, Images, Videos

Multimodal Opinionlevel Sentiment Intensity (MOSI)

Multimodal Opinionlevel Sentiment Intensity (MOSI) contains: (1) multimodal observations including transcribed speech and visual gestures as well as automatic audio and visual features, (2) opinion-level subjectivity segmentation, (3) sentiment intensity annotations with high coder agreement, and (4) alignment between words, visual and acoustic features.

69 papers0 benchmarksSpeech, Videos

MOT15 (Multiple Object Tracking 15)

MOT2015 is a dataset for multiple object tracking. It contains 11 different indoor and outdoor scenes of public places with pedestrians as the objects of interest, where camera motion, camera angle and imaging condition vary greatly. The dataset provides detections generated by the ACF-based detector.

67 papers4 benchmarksTracking, Videos

WLASL (Word-Level American Sign Language)

WLASL is a large video dataset for Word-Level American Sign Language (ASL) recognition, which features 2,000 common different words in ASL.

66 papers1 benchmarksVideos

MMI (MMI Facial Expression Database)

The MMI Facial Expression Database consists of over 2900 videos and high-resolution still images of 75 subjects. It is fully annotated for the presence of AUs in videos (event coding), and partially coded on frame-level, indicating for each frame whether an AU is in either the neutral, onset, apex or offset phase. A small part was annotated for audio-visual laughters.

65 papers6 benchmarksImages, Videos

CAD-120

The CAD-60 and CAD-120 data sets comprise of RGB-D video sequences of humans performing activities which are recording using the Microsoft Kinect sensor. Being able to detect human activities is important for making personal assistant robots useful in performing assistive tasks. The CAD dataset comprises twelve different activities (composed of several sub-activities) performed by four people in different environments, such as a kitchen, a living room, and office, etc.

65 papers8 benchmarksImages, RGB-D, Videos

FakeAVCeleb

FakeAVCeleb is a novel Audio-Video Deepfake dataset that not only contains deepfake videos but respective synthesized cloned audios as well.

65 papers12 benchmarksVideos

Mall (Mall Dataset)

The Mall is a dataset for crowd counting and profiling research. Its images are collected from publicly accessible webcam. It mainly includes 2,000 video frames, and the head position of every pedestrian in all frames is annotated. A total of more than 60,000 pedestrians are annotated in this dataset.

64 papers1 benchmarksImages, Videos

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research.

63 papers13 benchmarksVideos

CSL-Daily

CSL-Daily (Chinese Sign Language Corpus) is a large-scale continuous SLT dataset. It provides both spoken language translations and gloss-level annotations. The topic revolves around people's daily lives (e.g., travel, shopping, medical care), the most likely SLT application scenario.

63 papers2 benchmarksRGB Video, Texts, Videos

UTD-MHAD

The UTD-MHAD dataset consists of 27 different actions performed by 8 subjects. Each subject repeated the action for 4 times, resulting in 861 action sequences in total. The RGB, depth, skeleton and the inertial sensor signals were recorded.

62 papers3 benchmarksImages, Videos

COCO-QA

COCO-QA is a dataset for visual question answering. It consists of:

62 papers0 benchmarksImages, Videos

SFEW (Static Facial Expression in the Wild)

The Static Facial Expressions in the Wild (SFEW) dataset is a dataset for facial expression recognition. It was created by selecting static frames from the AFEW database by computing key frames based on facial point clustering. The most commonly used version, SFEW 2.0, was the benchmarking data for the SReco sub-challenge in EmotiW 2015. SFEW 2.0 has been divided into three sets: Train (958 samples), Val (436 samples) and Test (372 samples). Each of the images is assigned to one of seven expression categories, i.e., anger, disgust, fear, neutral, happiness, sadness, and surprise. The expression labels of the training and validation sets are publicly available, whereas those of the testing set are held back by the challenge organizer.

61 papers6 benchmarksImages, Videos

TVQA+

TVQA+ contains 310.8K bounding boxes, linking depicted objects to visual concepts in questions and answers.

60 papers0 benchmarksTexts, Videos

CASIA-MFSD

CASIA-MFSD is a dataset for face anti-spoofing. It contains 50 subjects, and 12 videos for each subject under different resolutions and light conditions. Three different spoof attacks are designed: replay, warp print and cut print attacks. The database contains 600 video recordings, in which 240 videos of 20 subjects are used for training and 360 videos of 30 subjects for testing.

58 papers16 benchmarksImages, Videos

ETH (ETH Pedestrian)

ETH is a dataset for pedestrian detection. The testing set contains 1,804 images in three video clips. The dataset is captured from a stereo rig mounted on car, with a resolution of 640 x 480 (bayered), and a framerate of 13--14 FPS.

58 papers1 benchmarksImages, Videos

PreviousPage 7 of 51Next