1,019 machine learning datasets
1,019 dataset results
Short-Films 20K (SF20K) is the largest publicly available movie dataset. SF20K is composed of 20,143 amateur films and offers long-term video tasks in the form of multiple-choice and open-ended question answering.
A large-scale isolated Indian sign language dataset. It contains 2002 common words, used in daily communications among Indian deaf community. The dataset contains 40033 videos across 2002 words. The total duration of the dataset is around 36.2 hours with 7.8 Million frames.
Hawk Annotation Dataset includes language descriptions specifically for anomaly scenes in seven existing video anomaly datasets. These seven datasets include a variety of anomalous scenarios such as crime (UCF-Cirme), campus (ShanghaiTech and CUHK Avenue), pedestrian walkways (UCSD Ped1 and Ped2), traffic (DoTA), and human behavior (UBnormal). With the support of these visual scenarios, this dataset can perform comprehensive fine-tuning for various abnormal scenarios, being closer to open-world scenarios.
A large-scale, egocentric, multimodal, and context-aware dataset of human demonstrations of social navigation.
EE3P: Event-based Estimation of Periodic Phenomena Properties (Dataset) Kolář, J., Špetlík, R., Matas, J. (2024) Measuring Speed of Periodical Movements with Event Camera. In Proceedings of the 27th Computer Vision Winter Workshop, 2024
PPED: Periodic Phenomena Event-based Dataset The dataset features 12 one-second sequences of periodic phenomena (rotation - 01-06, flicker - 07-08, vibration - 09-10 and movement - 11-12) with GT frequencies ranging from 3.2Hz up to 2000Hz in file formats .raw and .hdf5.
This dataset comprises 1-minute fingertip video recordings collected from 150 anemic patients, ranging from 6 months to 32 years of age, with hemoglobin levels between 4.3 gm/dL and 12.4 gm/dL. The videos were recorded using a smartphone’s camera and flashlight, designed to capture PPG (Photoplethysmography) signals, which are essential for non-invasive hemoglobin level estimation. Key Features:
ConSLAM is a real-world dataset collected periodically on a construction site to measure the accuracy of mobile scanners' SLAM algorithms.
The ENF moving video dataset, which is a subset of the dataset used in Temporal Localization of Non-Static Digital Videos Using the Electrical Network Frequency , consists of video recording without the audio channel coupled with the corresponding power ENF signal reference in WAV format at a rate of 1 kHz. The dataset is made of 8 video clips recorded in Europe at 29.97 frames per second, with a duration of approximately 11-12 minutes, using a GoPro Hero 4 Black and an NK AC3061-4KN camera. In terms of content, videos 1-3 are entirely stationary, videos 4-5 are predominantly stationary with some movement, and videos 6-8 are non-stationary, meaning the camera is fixed, but there are moving objects in most frames. All videos depict natural, everyday indoor scenes (i.e., not plain backgrounds).
DailyMoth-70h is a fully self-contained ASL-to-English sign language dataset containing over 70h of video (48K clips) with aligned English captions of a single native ASL signer (white, male, and early middle-aged) from the ASL news channel TheDailyMoth. The primary purpose of the dataset is to be used as a benchmark and analysis dataset for (gloss-free) sign language translation.
A novel audio-visual mouse saliency (AViMoS) dataset with the following key-features:
A Chinese sign language dataset that includes dialogue information.
A small-scale, real-world Project Aria dataset with high quality static 3D oriented bounding boxs annotations.
We introduce a video dataset Bukva for Russian Dactyl Recognition task. Bukva dataset size is about 27 GB, and it contains 3757 RGB videos with more than 101 samples for each RSL alphabet sign, including dynamic ones. The dataset is divided into training set and test set by subject user_id. The training set includes 3097 videos, and the test set includes 660 videos. The total video recording time is ~4 hours. About 17% of the videos are recorded in HD format, and 70% of the videos are in FullHD resolution.
A new large-scale, in-thewild Mandarin dataset, CAS-VSR-S101 with 101.1 hours of data. The videos are sourced from broadcast news and conversational programs in Chinese, covering a highly diverse set of topics, speakers and filming conditions. The lengths of the utterances are naturally distributed between 0.01s and 10.57s, and image qualities and resolutions vary. News accounts for 82.4% of the programs. 70.4% of the utterances depict news anchors, hosts and correspondents, while 29.6% are those of interviewees and guests. In addition, at a ratio of approximately 1.5 : 1, male and female appearances are relatively balanced. It is divided into train, validation and test sets by TV channels to minimize speaker overlap, and at a ratio of roughly 8 : 1 : 1.5 in terms of duration. The validation and test sets are composed of programs broadcast on provincial TV channels. The dataset is available for academic use under a license.
Characterising multimedia content with relevant, reliable and discriminating tags is vital for multimedia information retrieval. With the rapid expansion of digital multimedia content, alternative methods to the existing explicit tagging are needed to enrich the pool of tagged content. Currently, social media websites encourage users to tag their content. However, the users’ intent when tagging multimedia content does not always match the information retrieval goals. A large portion of user defined tags are either motivated by increasing the popularity and reputation of a user in an online com-munity or based on individual and egoistic judgments. Moreover, users do not evaluate media content on the same criteria. Some might tag multimedia content with words to express their emotion while others might use tags to describe the content. For example, a picture receive different tags based on the objects in the image, the camera by which the picture was taken or the emotion a user felt look
We study dynamic appearance models of both relightable (BRDF) and non-relightable (RGB). For both we introduce new pilot datasets, allowing, for the first time, to study such phenomena: For RGB we provide 22 dynamic textures acquired from free online sources; For BRDFs, we further acquire a dataset of 21 flash-lit videos of time-varying materials, enabled by a simple-to-construct setup.
MVX incorporates realistic physical world simulation with a differentiable accurate ray tracing wireless simulation that includes multi-agent and multimodal datasets for AI-driven digital twin applications in vehicular communication systems.
CausalChaos! is a dataset for causal video question answering. It is based on Tom and Jerry cartoons. It features longer causal chains embedded in dynamic visual scenes. It also features challenging incorrect options, especially, Causal Confusion set which contains causally confounding incorrect options. All these factors prove to be challenging for current VLMs and other traditional Video Question Answering models.
The dataset SFU-HW-Objects-v1 contains bounding boxes and object class labels for High Efficiency Video Coding (HEVC) v1 Common Test Conditions (CTC) video sequences. The presented dataset contains only object labels; raw video sequences themselves can be obtained from the Joint Collaborative Team on Video Coding (JCT-VC). The dataset is used in the MPEG-VCM (Video Coding for Machines) and MPEG-FCM (Feature Coding for Machines) standardization efforts.