Datasets

1,019 machine learning datasets

1,019 dataset results

VTC (Videos, Titles and Comments)

VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.

2 papers0 benchmarksAudio, Images, Texts, Videos

DeePhy

DeePhy is a novel DeepFake Phylogeny dataset consisting of 5040 DeepFake videos generated using three different generation techniques. It is one of the first datasets which incorporates the concept of Deepfake Phylogeny which refers to the idea of generation of DeepFakes using multiple generation techniques in a sequential manner.

2 papers0 benchmarksImages, Videos

HAMMER

HAMMER dataset contains 13 Scenes. Each scene has two setups, with/without objects (with : scene includes several objects with various surface material, without : scene with only backgrounds - naked) and each scene has two camera trajectories. Each trajectories composed with roughly 300 frames, which adds up to 16k frames in total (13 x 2 x 2 x 300). Each trajectory contains corresponding images from each cameras : d435 – stereo, l515 – Lidar (D-ToF), polarization – RGBP (RGB with polarization), tof – (I-ToF). Each camera folder contains its intrinsic file and its own recorded images together with rendered depth GT / instance GT and camera pose. All the cameras are fully synchronized via robotic arm’s data acquisition setup.

2 papers0 benchmarksImages, Videos

SuHiFiMask (Surveillance High-Fidelity Mask)

SuHiFiMask (Surveillance High-Fidelity Mask) extends FAS to real surveillance scenes rather than mimicking low-resolution images and surveillance environments. It contains 10,195 videos from 101 subjects of different age groups, which are collected by 7 mainstream surveillance cameras.

2 papers0 benchmarksVideos

ActionBench

ActionBench contains two carefully designed probing tasks: Action Antonym and Video Reversal, which targets multimodal alignment capabilities and temporal understanding skills of the model, respectively. Action knowledge involves the understanding of textual, visual, and temporal aspects of actions. The benchmark is constructed by leveraging two existing open-domain video-language datasets, Ego4D and Something-Something v2 (SSv2), which provide fine-grained action annotations for each video clip.

2 papers0 benchmarksVideos

Drone-Action (Drone-Action: An Outdoor Recorded Drone Video Dataset for Action Recognition)

Website: https://asankagp.github.io/droneaction/

2 papers4 benchmarksImages, Videos

CVB (Video Dataset of Cattle Visual Behaviors)

Existing image/video datasets for cattle behavior recognition are mostly small, lack well-defined labels, or are collected in unrealistic controlled environments. This limits the utility of machine learning (ML) models learned from them. Therefore, we introduce a new dataset, called Cattle Visual Behaviors (CVB), that consists of 502 video clips, each fifteen seconds long, captured in natural lighting conditions, and annotated with eleven visually perceptible behaviors of grazing cattle. By creating and sharing CVB, our aim is to develop improved models capable of recognizing all important cattle behaviors accurately and to assist other researchers and practitioners in developing and evaluating new ML models for cattle behavior classification using video data. The dataset is presented in the form of following three sub-directories. 1. raw_frames: contains 450 frames in each sub folder representing a 15 second video taken at a frame rate of 30 FPS. 2. annotations: contains the json file

2 papers0 benchmarksActions, Images, Tracking, Videos

YouTube8M-MusicTextClips

The YouTube8M-MusicTextClips dataset consists of over 4k high-quality human text descriptions of music found in video clips from the YouTube8M dataset.

2 papers0 benchmarksAudio, Music, Texts, Videos

MoB (Malicious or Benign Cartoon Videos)

A dataset of cartoon video clips. For each video clip, the presence or absence of each feature was marked by the annotators.

2 papers2 benchmarksVideos

Drunkard's Dataset

Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. To tackle this issue with a common benchmark, we introduce the Drunkard’s Dataset, a challenging collection of synthetic data targeting visual navigation and reconstruction in deformable environments. This dataset is the first large set of exploratory camera trajectories with ground truth inside 3D scenes where every surface exhibits non-rigid deformations over time. Simulations in realistic 3D buildings lets us obtain a vast amount of data and ground truth labels, including camera poses, RGB images and depth, optical flow and normal maps at high resolution and quality.

2 papers0 benchmarks3D, Medical, RGB-D, Videos

BioVid (BioVid Heat Pain Database)

To advance methods for pain assessment, in particular automatic assessment methods, the BioVid Heat Pain Database was collected in a collaboration of the Neuro-Information Technology group of the University of Magdeburg and the Medical Psychology group of the University of Ulm. In our study, 90 participants were subjected to experimentally induced heat pain in four intensities. To compensate for varying heat pain sensitivities, the stimulation temperatures were adjusted based on the subject-specific pain threshold and pain tolerance. Each of the four pain levels was stimulated 20 times in randomized order. For each stimulus, the maximum temperature was held for 4 seconds. The pauses between the stimuli were randomized between 8-12 seconds. The pain stimulation experiment was conducted twice: once with un-occluded face and once with facial EMG sensors.

2 papers0 benchmarksBiomedical, Medical, Videos

StoryBench (StoryBench: A Multifaceted Benchmark for Continuous Story Visualization)

StoryBench is a multi-task benchmark to reliably evaluate the ability of text-to-video models to generate stories from a sequence of captions and their duration. It includes three datasets (DiDeMo, Oops, UVO) and three video generation tasks of increasing difficulty: action execution, where the next action must be generated starting from a conditioning video; story continuation, where a sequence of actions must be executed starting from a conditioning video; and story generation, where a video must be generated from only text prompts.

2 papers0 benchmarksTexts, Videos

SB20 (Sugar Beet 2020 University of Bonn)

Video sequences captured at a field on Campus Kleinaltendorf (CKA), University of Bonn, captured by BonBot-I, an autonomous weeding robot. The data was captured by mounting an Intel RealSense D435i sensor with a nadir view of the ground.

2 papers0 benchmarksImages, RGB Video, RGB-D, Videos

ISLTranslate

Sign languages are the primary means of communication for a large number of people worldwide. Recently, the availability of Sign language translation datasets have facilitated the incorporation of Sign language research in the NLP community. Though a wide variety of research focuses on improving translation systems for sign language, the lack of ample annotated resources hinders research in the data driven natural language processing community. In this resource paper, we introduce ISLTranslate, a translation dataset for continuous Indian Sign Language (ISL), consisting of 30k ISL-English sentence pairs. To the best of our knowledge, it is the first and largest translation dataset for continuous Indian Sign Language with corresponding English transcripts. We provide a detailed analysis of the dataset and examine the distribution of words and phrases covered in the proposed dataset. To validate the performance of existing end-to-end Sign language to spoken language translation systems, w

2 papers0 benchmarksVideos

3DYoga90 (3DYoga90: A Hierarchical Video Dataset for Yoga Pose Understanding)

3DYoga90 is organized within a three-level label hierarchy. It stands out as one of the most comprehensive open datasets, featuring the largest collection of RGB videos and 3D skeleton sequences among publicly available resources.

2 papers0 benchmarks3D, Actions, RGB Video, Videos

CholecTrack20 (Multi-Perspective Multi-Class Multi-Object Tracking Dataset For Surgical Tools)

CholecTrack20 is a surgical video dataset focusing on laparoscopic cholecystectomy and designed for surgical tool tracking, featuring 20 annotated videos. The dataset includes detailed labels for multi-class multi-tool tracking, offering trajectories for tool visibility within the camera scope, intracorporeal movement within the patient's body, and the life-long intraoperative trajectory of each tool. Annotations cover spatial coordinates, tool class, operator identity, phase, visual conditions (occlusion, bleeding, smoke), and more for tools like grasper, bipolar, hook, scissors, clipper, irrigator, and specimen bag, with annotations provided at 1 frame per second across 35K frames and 65K instance tool labels. The dataset uses official splits, allocating 10 videos for training, 2 for validation, and 8 for testing.

2 papers0 benchmarksImages, Videos

TVL Dataset (Touch-Vision-Language Dataset)

Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human-labeled, the TVL

2 papers0 benchmarksImages, Texts, Time series, Videos

YTSeg

We present YTSeg, a topically and structurally diverse benchmark for the text segmentation task based on YouTube transcriptions. The dataset comprises 19,299 videos from 393 channels, amounting to 6,533 content hours. The topics are wide-ranging, covering domains such as science, lifestyle, politics, health, economy, and technology. The videos are from various types of content formats, such as podcasts, lectures, news, corporate events \& promotional content, and, more broadly, videos from individual content creators. We refer to the paper for further information.

2 papers2 benchmarksAudio, Texts, Videos

GOTCHA

We release the dataset for non-commercial research. Submit requests <a href="https://forms.gle/6WPEGNWbYoEe6bte8" target="_blank">here</a>.

2 papers0 benchmarksImages, Speech, Videos

AVMIT (Audiovisual Moments in Time)

Audiovisual Moments in Time (AVMIT) is a large-scale dataset of audiovisual action events. The dataset includes the annotation of 57,177 audiovisual videos from the Moments in Time dataset, each independently evaluated by 3 of 11 trained participants. Each annotation pertains to whether the labelled audiovisual action event is present and whether it is the most prominent feature of the video. The dataset also provides a curated test set of 960 videos across 16 classes, suitable for comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.

2 papers0 benchmarksVideos

PreviousPage 35 of 51Next