Datasets

1,019 machine learning datasets

1,019 dataset results

Everybody Dance Now

Everybody Dance Now is a dataset of videos that can be used for training and motion transfer. It contains long single-dancer videos that can be used to train and evaluate the model. All subjects have consented to allowing the data to be used for research purposes.

15 papers0 benchmarksVideos

LIVE-FB LSVQ (LIVE-FB Large-Scale Social Video Quality (LSVQ) Database)

No-reference (NR) perceptual video quality assessment (VQA) is a complex, unsolved, and important problem to social and streaming media applications. Efficient and accurate video quality predictors are needed to monitor and guide the processing of billions of shared, often imperfect, user-generated content (UGC). Unfortunately, current NR models are limited in their prediction capabilities on real-world, "in-the-wild" UGC video data. To advance progress on this problem, we created the largest (by far) subjective video quality dataset, containing 39, 000 real-world distorted videos and 117, 000 space-time localized video patches ("v-patches"), and 5.5M human perceptual quality annotations. Using this, we created two unique NR-VQA models: (a) a local-to-global region-based NR VQA architecture (called PVQ) that learns to predict global video quality and achieves state-of-the-art performance on 3 UGC datasets, and (b) a first-of-a-kind space-time video quality mapping engine (called PVQ Ma

15 papers3 benchmarksVideos

MedVidQA (Medical Video Question Answering)

The MedVidQA dataset contains the collection of 3, 010 manually created health-related questions and timestamps as visual answers to those questions from trusted video sources, such as accredited medical schools with an established reputation, health institutes, health education, and medical practitioners.

15 papers0 benchmarksMedical, Texts, Videos

CPED (Chinese Personalized and Emotional Dialogue)

We construct a dataset named CPED from 40 Chinese TV shows. CPED consists of multisource knowledge related to empathy and personal characteristic. This knowledge covers 13 emotions, gender, Big Five personality traits, 19 dialogue acts and other knowledge.

15 papers16 benchmarksAudio, Texts, Videos

OpenASL

Large-scale American Sign Language (ASL) - English dataset collected from online video sites (e.g., YouTube). OpenASL contains 288 hours of ASL videos in multiple domains from over 200 signers.

15 papers0 benchmarksTexts, Videos

CelebV-Text

CelebV-Text comprises 70,000 in-the-wild face video clips with diverse visual content, each paired with 20 texts generated using the proposed semi-automatic text generation strategy. The provided texts describes both static and dynamic attributes precisely.

15 papers0 benchmarksTexts, Videos

PointOdyssey

PointOdyssey is a large-scale synthetic dataset, and data generation framework, for the training and evaluation of long-term fine-grained tracking algorithms. The dataset currently includes 104 videos, averaging 2,000 frames long, with orders of magnitude more correspondence annotations than prior work.

15 papers6 benchmarksVideos

Condensed Movies

A large-scale video dataset, featuring clips from movies with detailed captions.

15 papers6 benchmarksTexts, Videos

AVSD (Audio-Visual Scene-Aware Dialog)

The Audio Visual Scene-Aware Dialog (AVSD) dataset, or DSTC7 Track 3, is a audio-visual dataset for dialogue understanding. The goal with the dataset and track was to design systems to generate responses in a dialog about a video, given the dialog history and audio-visual content of the video.

14 papers1 benchmarksAudio, Texts, Videos

4DFAB

4DFAB is a large scale database of dynamic high-resolution 3D faces which consists of recordings of 180 subjects captured in four different sessions spanning over a five-year period (2012 - 2017), resulting in a total of over 1,800,000 3D meshes. It contains 4D videos of subjects displaying both spontaneous and posed facial behaviours. The database can be used for both face and facial expression recognition, as well as behavioural biometrics. It can also be used to learn very powerful blendshapes for parametrising facial behaviour.

14 papers0 benchmarksImages, Videos

HAA500 (Human-Centric Atomic Action Dataset)

HAA500 is a manually annotated human-centric atomic action dataset for action recognition on 500 classes with over 591k labeled frames. Unlike existing atomic action datasets, where coarse-grained atomic actions were labeled with action-verbs, e.g., "Throw", HAA500 contains fine-grained atomic actions where only consistent actions fall under the same label, e.g., "Baseball Pitching" vs "Free Throw in Basketball", to minimize ambiguities in action classification. HAA500 has been carefully curated to capture the movement of human figures with less spatio-temporal label noises to greatly enhance the training of deep neural networks.

14 papers2 benchmarksVideos

LIVE-YT-HFR (LIVE YouTube High Frame Rate)

LIVE-YT-HFR comprises of 480 videos having 6 different frame rates, obtained from 16 diverse contents.

14 papers3 benchmarksVideos

ViTT (Video Timeline Tags)

The ViTT dataset consists of human produced segment-level annotations for 8,169 videos. Of these, 5,840 videos have been annotated once, and the rest of the videos have been annotated twice or more. A total of 12,461 sets of annotations are released. The videos in the dataset are from the Youtube-8M dataset.

14 papers6 benchmarksVideos

RealEstate10K

RealEstate10K is a large dataset of camera poses corresponding to 10 million frames derived from about 80,000 video clips, gathered from about 10,000 YouTube videos. For each clip, the poses form a trajectory where each pose specifies the camera position and orientation along the trajectory. These poses are derived by running SLAM and bundle adjustment algorithms on a large set of videos.

14 papers5 benchmarksStereo, Videos

MGif

MGif is a dataset of videos containing movements of different cartoon animals. Each video is a moving gif file. The dataset consists of 1000 videos. The dataset is particularly challenging because of the high appearance variation and motion diversity.

14 papers2 benchmarksVideos

PreviousPage 16 of 51Next

Datasets

Everybody Dance Now

LIVE-FB LSVQ (LIVE-FB Large-Scale Social Video Quality (LSVQ) Database)

MedVidQA (Medical Video Question Answering)

CPED (Chinese Personalized and Emotional Dialogue)

OpenASL

CelebV-Text

PointOdyssey

Condensed Movies

AVSD (Audio-Visual Scene-Aware Dialog)

4DFAB

HAA500 (Human-Centric Atomic Action Dataset)

LIVE-YT-HFR (LIVE YouTube High Frame Rate)

ViTT (Video Timeline Tags)

RealEstate10K

MGif

DeepStab

HiFiMask (CASIA-SURF HiFiMask)

HR-ShanghaiTech

LAV-DF (Localized Audio Visual DeepFake Dataset)

Fisheye

Datasets

Everybody Dance Now

LIVE-FB LSVQ (LIVE-FB Large-Scale Social Video Quality (LSVQ) Database)

MedVidQA (Medical Video Question Answering)

CPED (Chinese Personalized and Emotional Dialogue)

OpenASL

CelebV-Text

PointOdyssey

Condensed Movies

AVSD (Audio-Visual Scene-Aware Dialog)

4DFAB

HAA500 (Human-Centric Atomic Action Dataset)

LIVE-YT-HFR (LIVE YouTube High Frame Rate)

ViTT (Video Timeline Tags)

RealEstate10K

MGif

DeepStab

HiFiMask (CASIA-SURF HiFiMask)

HR-ShanghaiTech

LAV-DF (Localized Audio Visual DeepFake Dataset)

Fisheye