Datasets

1,019 machine learning datasets

1,019 dataset results

AV-Deepfake1M

The detection and localization of highly realistic deepfake audio-visual content are challenging even for the most advanced state-of-the-art methods. While most of the research efforts in this domain are focused on detecting high-quality deepfake images and videos, only a few works address the problem of the localization of small segments of audio-visual manipulations embedded in real videos. In this research, we emulate the process of such content generation and propose the AV-Deepfake1M dataset. The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos. The paper provides a thorough description of the proposed data generation pipeline accompanied by a rigorous analysis of the quality of the generated data. The comprehensive benchmark of the proposed dataset utilizing state-of-the-art deepfake detection and localization methods indicates a significant

9 papers0 benchmarksVideos

VLM2-Bench (VLM²-Bench)

VLM²-Bench: Benchmarking Vision-Language Models on Visual Cue Matching Description VLM²-Bench is the first comprehensive benchmark designed to evaluate vision-language models' (VLMs) ability to visually link matching cues across multi-image sequences and videos. The benchmark consists of 9 subtasks with over 3,000 test cases, focusing on fundamental visual linking capabilities that humans use daily. A key example is identifying the same person across different photos without prior knowledge of their identity.

9 papers10 benchmarksImages, Texts, Videos

Kinetics-700-2020

We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset. In this new version, there are at least 700 video clips from different YouTube videos for each of the 700 classes. This paper details the changes introduced for this new release of the dataset and includes a comprehensive set of statistics as well as baseline results using the I3D network.

9 papers1 benchmarksVideos

QuadTrack

Most existing MOT datasets are captured using pinhole cameras, which are characterized by a narrow-FoV and linear sensor motion. However, when panoramic-FoV capture devices experience even slight movements, the entire scene can change drastically, posing significant challenges for object tracking. QuadTrack addresses this challenge by providing a benchmark specifically designed to test MOT algorithms under dynamic, non-linear motion conditions. It enables evaluating algorithm robustness in tracking objects with panoramic, non-uniform motion.

9 papers1 benchmarksImages, Tracking, Videos

YouTube-100M (YouTube-100m)

The YouTube-100M data set consists of 100 million YouTube videos: 70M training videos, 10M evaluation videos, and 20M validation videos. Videos average 4.6 minutes each for a total of 5.4M training hours. Each of these videos is labeled with 1 or more topic identifiers from a set of 30,871 labels. There are an average of around 5 labels per video. The labels are assigned automatically based on a combination of metadata (title, description, comments, etc.), context, and image content for each video. The labels apply to the entire video and range from very generic (e.g. “Song”) to very specific (e.g. “Cormorant”). Being machine generated, the labels are not 100% accurate and of the 30K labels, some are clearly acoustically relevant (“Trumpet”) and others are less so (“Web Page”). Videos often bear annotations with multiple degrees of specificity. For example, videos labeled with “Trumpet” are often labeled “Entertainment” as well, although no hierarchy is enforced.

8 papers0 benchmarksAudio, Videos

FVI (Free-form Video Inpainting)

The Free-Form Video Inpainting dataset is a dataset used for training and evaluation video inpainting models. It consists of 1940 videos from the YouTube-VOS dataset and 12,600 videos from the YouTube-BoundingBoxes.

8 papers0 benchmarksVideos

SAMM Long Videos

The SAMM Long Videos dataset consists of 147 long videos with 343 macro-expressions and 159 micro-expressions. The dataset is FACS-coded with detailed Action Units.

8 papers0 benchmarksVideos

SEND (Stanford Emotional Narratives Dataset)

SEND (Stanford Emotional Narratives Dataset) is a set of rich, multimodal videos of self-paced, unscripted emotional narratives, annotated for emotional valence over time. The complex narratives and naturalistic expressions in this dataset provide a challenging test for contemporary time-series emotion recognition models.

8 papers0 benchmarksVideos

Video Storytelling

A new dataset describing textual stories for events.

8 papers0 benchmarksTexts, Videos

VMSMO

The Video-based Multimodal Summarization with Multimodal Output (VMSMO) corpus consists of 184,920 document-summary pairs, with 180,000 training pairs, 2,460 validation and test pairs. The task for this dataset is generating and appropriate textual summary of an article and choosing a proper cover frame from a video accompanying the article.

8 papers0 benchmarksTexts, Videos

BosphorusSign22k

BosphorusSign22k is a benchmark dataset for vision-based user-independent isolated Sign Language Recognition (SLR). The dataset is based on the BosphorusSign (Camgoz et al., 2016c) corpus which was collected with the purpose of helping both linguistic and computer science communities. It contains isolated videos of Turkish Sign Language glosses from three different domains: Health, finance and commonly used everyday signs. Videos in this dataset were performed by six native signers, which makes this dataset valuable for user independent sign language studies.

8 papers0 benchmarksVideos

MISAW (MIcro-Surgical Anastomose Workflow recognition on training sessions)

The MISAW data set is composed of 27 sequences of micro-surgical anastomosis on artificial blood vessels performed by 3 surgeons and 3 engineering students. The dataset contained video, kinematic, and procedural descriptions synchronized at 30Hz. The procedural descriptions contained phases, steps, and activities performed by the participants.

8 papers1 benchmarksMedical, Videos

TikTok Dataset (Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos)

We learn high fidelity human depths by leveraging a collection of social media dance videos scraped from the TikTok mobile social networking application. It is by far one of the most popular video sharing applications across generations, which include short videos (10-15 seconds) of diverse dance challenges as shown above. We manually find more than 300 dance videos that capture a single person performing dance moves from TikTok dance challenge compilations for each month, variety, type of dances, which are moderate movements that do not generate excessive motion blur. For each video, we extract RGB images at 30 frame per second, resulting in more than 100K images. We segmented these images using Removebg application, and computed the UV coordinates from DensePose.

8 papers0 benchmarksVideos

OPERAnet

OPERAnet is a multimodal activity recognition dataset acquired from radio frequency and vision-based sensors. Approximately 8 hours of annotated measurements are provided, which are collected across two different rooms from 6 participants performing 6 activities, namely, sitting down on a chair, standing from sit, lying down on the ground, standing from the floor, walking and body rotating. The dataset has been acquired from four synchronized modalities for the purpose of passive Human Activity Recognition (HAR) as well as localization and crowd counting.

8 papers0 benchmarksVideos

OULU-NPU

The Oulu-NPU face presentation attack detection database consists of 4950 real access and attack videos. These videos were recorded using the front cameras of six mobile devices (Samsung Galaxy S6 edge, HTC Desire EYE, MEIZU X5, ASUS Zenfone Selfie, Sony XPERIA C5 Ultra Dual and OPPO N3) in three sessions with different illumination conditions and background scenes. The presentation attack types considered in the OULU-NPU database are print and video-replay. The 2D face artefacts were created using two printers and two display devices.

8 papers16 benchmarksImages, Videos

ROBUST-MIS (Robust Medical Instrument Segmentation Challenge 2019)

The ROBUST-MIS dataset was made available to support the Robust Medical Instrument Segmentation (ROBUST-MIS) Challenge 2019, part of the Endoscopic Vision Challenge associated with MICCAI.

8 papers3 benchmarksImages, Videos

Kinetics-100

Kinetics-100 is a dataset split created from the Kinetics dataset to evaluate the performance of few-shot action recognition models. 100 classes are randomly selected from a total of 400 categories, each composed of 100 examples. The 100 classes are further split into 64, 12, and 24 non-overlapping classes to use as the meta-training set, meta-validation set, and meta-testing set, respectively. Link to the selected samples can be found here: https://github.com/ffmpbgrnn/CMN/tree/master/kinetics-100

8 papers2 benchmarksVideos

DarkTrack2021

DarkTrack2021 is a challenging nighttime UAV tracking benchmark, which contains 110 challenging sequences with over 100 K frames in total.

8 papers0 benchmarksVideos

OPRA (Online Product Reviews for Affordances)

The OPRA Dataset was introduced in Demo2Vec: Reasoning Object Affordances From Online Videos (CVPR'18) for reasoning object affordances from online demonstration videos. It contains 11,505 demonstration clips and 2,512 object images scraped from 6 popular YouTube product review channels along with the corresponding affordance annotations. More details can be found on our https://sites.google.com/view/demo2vec/.

8 papers2 benchmarksImages, Videos

YouTube-ASL

YouTube-ASL is a large-scale, open-domain corpus of American Sign Language (ASL) videos and accompanying English captions drawn from YouTube. With ~1000 hours of videos and >2500 unique signers, YouTube-ASL is ~3x as large and has ~10x as many unique signers as the largest prior ASL dataset.

8 papers0 benchmarksVideos

PreviousPage 21 of 51Next