Datasets

1,019 machine learning datasets

1,019 dataset results

SUN3D

SUN3D contains a large-scale RGB-D video database, with 8 annotated sequences. Each frame has a semantic segmentation of the objects in the scene and information about the camera pose. It is composed by 415 sequences captured in 254 different spaces, in 41 different buildings. Moreover, some places have been captured multiple times at different moments of the day.

126 papers0 benchmarks3D, Images, Point cloud, RGB-D, Videos

DFDC (Deepfake Detection Challenge)

The DFDC (Deepfake Detection Challenge) is a dataset for deepface detection consisting of more than 100,000 videos.

126 papers8 benchmarksVideos

GTEA (Georgia Tech Egocentric Activity)

The Georgia Tech Egocentric Activities (GTEA) dataset contains seven types of daily activities such as making sandwich, tea, or coffee. Each activity is performed by four different people, thus totally 28 videos. For each video, there are about 20 fine-grained action instances such as take bread, pour ketchup, in approximately one minute.

120 papers20 benchmarksImages, Videos

VATEX (Video And TEXt)

VATEX is multilingual, large, linguistically complex, and diverse dataset in terms of both video and natural language descriptions. It has two tasks for video-and-language research: (1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context.

118 papers26 benchmarksTexts, Videos

Something-Something V1

The 20BN-SOMETHING-SOMETHING dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects. The dataset was created by a large number of crowd workers. It allows machine learning models to develop fine-grained understanding of basic actions that occur in the physical world. It contains 108,499 videos, with 86,017 in the training set, 11,522 in the validation set and 10,960 in the test set. There are 174 labels.

117 papers11 benchmarksImages, Videos

Kvasir (The Kvasir Dataset)

The KVASIR Dataset was released as part of the medical multimedia challenge presented by MediaEval. It is based on images obtained from the GI tract via an endoscopy procedure. The dataset is composed of images that are annotated and verified by medical doctors, and captures 8 different classes. The classes are based on three anatomical landmarks (z-line, pylorus, cecum), three pathological findings (esophagitis, polyps, ulcerative colitis) and two other classes (dyed and lifted polyps, dyed resection margins) related to the polyp removal process. Overall, the dataset contains 8,000 endoscopic images, with 1,000 image examples per class.

117 papers2 benchmarksImages, Videos

LRS2 (Lip Reading Sentences 2)

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences in-the-wild. The database consists of mainly news and talk shows from BBC programs. Each sentence is up to 100 characters in length. The training, validation and test sets are divided according to broadcast date. It is a challenging set since it contains thousands of speakers without speaker labels and large variation in head pose. The pre-training set contains 96,318 utterances, the training set contains 45,839 utterances, the validation set contains 1,082 utterances and the test set contains 1,242 utterances.

115 papers62 benchmarksAudio, Texts, Videos

RWTH-PHOENIX-Weather 2014 T

Over a period of three years (2009 - 2011) the daily news and weather forecast airings of the German public tv-station PHOENIX featuring sign language interpretation have been recorded and the weather forecasts of a subset of 386 editions have been transcribed using gloss notation. Furthermore, we used automatic speech recognition with manual cleaning to transcribe the original German speech. As such, this corpus allows to train end-to-end sign language translation systems from sign language video input to spoken language.

114 papers3 benchmarksVideos

AVA (Atomic Visual Actions)

AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity. Each of the video clips has been exhaustively annotated by human annotators, and together they represent a rich variety of scenes, recording conditions, and expressions of human activity. There are annotations for:

113 papers2 benchmarksVideos

VOT2016

VOT2016 is a video dataset for visual object tracking. It contains 60 video clips and 21,646 corresponding ground truth maps with pixel-wise annotation of salient objects.

113 papers2 benchmarksTracking, Videos

EgoSchema

EgoSchema is very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior.

112 papers1 benchmarksVideos

OTB-2013

OTB2013 is the previous version of the current OTB2015 Visual Tracker Benchmark. It contains only 50 tracking sequences, as opposed to the 100 sequences in the current version of the benchmark.

110 papers3 benchmarksImages, Videos

Penn Action

The Penn Action Dataset contains 2326 video sequences of 15 different actions and human joint annotations for each sequence.

110 papers2 benchmarksImages, Videos

SegTrack-v2

SegTrack v2 is a video segmentation dataset with full pixel-level annotations on multiple objects at each frame within each video.

107 papers4 benchmarksImages, Videos

JIGSAWS (JHU-ISI Gesture and Skill Assessment Working Set)

The JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) is a surgical activity dataset for human motion modeling. The data was collected through a collaboration between The Johns Hopkins University (JHU) and Intuitive Surgical, Inc. (Sunnyvale, CA. ISI) within an IRB-approved study. The release of this dataset has been approved by the Johns Hopkins University IRB. The dataset was captured using the da Vinci Surgical System from eight surgeons with different levels of skill performing five repetitions of three elementary surgical tasks on a bench-top model: suturing, knot-tying and needle-passing, which are standard components of most surgical skills training curricula. The JIGSAWS dataset consists of three components:

105 papers13 benchmarksVideos

COIN

The COIN dataset (a large-scale dataset for COmprehensive INstructional video analysis) consists of 11,827 videos related to 180 different tasks in 12 domains (e.g., vehicles, gadgets, etc.) related to our daily life. The videos are all collected from YouTube. The average length of a video is 2.36 minutes. Each video is labelled with 3.91 step segments, where each segment lasts 14.91 seconds on average. In total, the dataset contains videos of 476 hours, with 46,354 annotated segments.

105 papers4 benchmarksVideos

BP4D

The BP4D-Spontaneous dataset is a 3D video database of spontaneous facial expressions in a diverse group of young adults. Well-validated emotion inductions were used to elicit expressions of emotion and paralinguistic communication. Frame-level ground-truth for facial actions was obtained using the Facial Action Coding System. Facial features were tracked in both 2D and 3D domains using both person-specific and generic approaches. The database includes forty-one participants (23 women, 18 men). They were 18 – 29 years of age; 11 were Asian, 6 were African-American, 4 were Hispanic, and 20 were Euro-American. An emotion elicitation protocol was designed to elicit emotions of participants effectively. Eight tasks were covered with an interview process and a series of activities to elicit eight emotions. The database is structured by participants. Each participant is associated with 8 tasks. For each task, there are both 3D and 2D videos. As well, the Metadata include manually annotated

104 papers21 benchmarks3D, Images, Videos

PoseTrack

The PoseTrack dataset is a large-scale benchmark for multi-person pose estimation and tracking in videos. It requires not only pose estimation in single frames, but also temporal tracking across frames. It contains 514 videos including 66,374 frames in total, split into 300, 50 and 208 videos for training, validation and test set respectively. For training videos, 30 frames from the center are annotated. For validation and test videos, besides 30 frames from the center, every fourth frame is also annotated for evaluating long range articulated tracking. The annotations include 15 body keypoints location, a unique person id and a head bounding box for each person instance.

103 papers0 benchmarksImages, Tracking, Videos

CUHK-SYSU (CUHK-SYSU Person Search Dataset)

The CUKL-SYSY dataset is a large scale benchmark for person search, containing 18,184 images and 8,432 identities. Different from previous re-id benchmarks, matching query persons with manually cropped pedestrians, this dataset is much closer to real application scenarios by searching person from whole images in the gallery.

100 papers4 benchmarksImages, Videos

UAVDT (Unmanned Aerial Vehicle Benchmark Object Detection and Tracking)

UAVDT is a large scale challenging UAV Detection and Tracking benchmark (i.e., about 80, 000 representative frames from 10 hours raw videos) for 3 important fundamental tasks, i.e., object DETection (DET), Single Object Tracking (SOT) and Multiple Object Tracking (MOT).

96 papers9 benchmarksVideos

PreviousPage 5 of 51Next