Datasets

1,019 machine learning datasets

1,019 dataset results

SSv2-Spatio-Temporal (Something Someting v2-Spatio-Temporal)

We use Something-Something v2 dataset to obtain the generation prompts and ground truth masks from real action videos. We filter out a set of 295 prompts. The details for this filtering are in the "Peekaboo: Interactive Video Generation via Masked-Diffusion" paper. We then use an off-the-shelf OWL-ViT-large open-vocabulary object detector to obtain the bounding box (bbox) annotations of the object in the videos. This set represents bbox and prompt pairs of real-world videos, serving as a test bed for both the quality and control of methods for generating realistic videos with spatio-temporal control.

1 papers0 benchmarksInteractive, Texts, Tracking, Videos

Social-IQ 2.0

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksAudio, Texts, Videos

RMOT-223

In this dataset, various objects are arranged on a white table. A UR5e robot picks and place a target object specified on the title of the video/image sequence. Videos under auto- folder are collected with automatic operation of the robot. Videos under human- folders are collected with the tele-operation of the robot. Ground-truth tracking bounding boxes are generated with STARK, and when the target exits the camera frame, the bounding box estimation is switched to [-1, -1, -1, -1], indicating target not shown.

1 papers0 benchmarksImages, Videos

NES-VMDB

NES-VMDB is a dataset containing 98,940 gameplay videos from 389 NES games, each paired with its original soundtrack in symbolic format (MIDI). NES-VMDB is built upon the Nintendo Entertainment System Music Database (NES-MDB), encompassing 5,278 music pieces from 397 NES games.

1 papers0 benchmarksMidi, Videos

IDD-X

Intelligent vehicle systems require a deep understanding of the interplay between road conditions, surrounding entities, and the ego vehicle's driving behavior for explainable driving decision-making and safe and efficient navigation. This is particularly critical in developing countries where traffic situations are often dense and unstructured with heterogeneous road occupants. Existing datasets, predominantly geared towards structured and sparse traffic scenarios, fall short of capturing the complexity of driving in such environments. To fill this gap, we present IDD-X, a large-scale dual-view driving video dataset. With 697K bounding boxes, 9K important object tracks, and 1-12 objects per video, IDD-X offers comprehensive ego-relative annotations for multiple important road objects covering 10 categories and 19 explanation label categories. The dataset also incorporates rearview information to provide a more complete representation of the driving environment. We also introduce custo

1 papers0 benchmarksTexts, Videos

MINDS-Libras

Brazilian Sign Language (Libras) data set with 20 signs for sign language and gesture recognition benchmark:

1 papers4 benchmarksVideos

LIBRAS-UFOP

A multimodal LIBRAS-UFOP Brazilian sign language dataset of minimal pairs using a microsoft Kinect senso.

1 papers4 benchmarksVideos

SoccerNet-Echoes (SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset)

SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset.

1 papers0 benchmarksAudio, Texts, Videos

CinePile: A Long Video Question Answering Dataset and Benchmark

CinePile is a question-answering-based, long-form video understanding dataset. It has been created using advanced large language models (LLMs) with human-in-the-loop pipeline leveraging existing human-generated raw data. It consists of approximately 300,000 training data points and 5,000 test data points.

1 papers2 benchmarksTexts, Videos

Vid2RealHRI online video and results dataset (Community embedded robotics: Vid2RealHRI online video and perceived social intelligence in human-robot encounters dataset)

Introduction This dataset was gathered during the Vid2RealHRI study of humans’ perception of robots' intelligence in the context of an incidental Human-Robot encounter. The dataset contains participants' questionnaire responses to four video study conditions, namely Baseline, Verbal, Body language, and Body language + Verbal. The videos depict a scenario where a pedestrian incidentally encounters a quadruped robot trying to enter a building. The robot uses verbal commands or body language to try to ask for help from the pedestrian in different study conditions. The differences in the conditions were manipulated using the robot’s verbal and expressive movement functionalities.

1 papers0 benchmarksImages, Tabular, Texts, Videos

DADE (Driving Agents in Dynamic Environments)

The DADE dataset, short for Driving Agents in Dynamic Environments, is a synthetic dataset designed for the training and evaluation of methods for the task of semantic segmentation in the context of autonomous driving agents navigating dynamic environments and weather conditions.

1 papers0 benchmarksImages, RGB Video, Videos

Mediapi-RGB

Mediapi-RGB is a bilingual corpus of French Sign Language (LSF) and written French in the form of subtitled videos, accompanied by complementary data (various representations, segmentation, vocabulary, etc.). It can be used in academic research for a wide range of tasks, such as training or evaluating sign language (SL) extraction, recognition or translation models.

1 papers1 benchmarksTexts, Videos

PedSynth

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksVideos

VCG+112K (Video Instruction Dataset 112K)

Video-ChatGPT introduces the VideoInstruct100K dataset, which employs a semi-automatic annotation pipeline to generate 75K instruction-tuning QA pairs. To address the limitations of this annotation process, we present \ourdata~dataset developed through an improved annotation pipeline. Our approach improves the accuracy and quality of instruction tuning pairs by improving keyframe extraction, leveraging SoTA large multimodal models (LMMs) for detailed descriptions, and refining the instruction generation strategy.

1 papers0 benchmarksTexts, Videos

MINT (a Multi-modal Image and Narrative Text Dubbing Dataset)

Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on detailed and acoustically relevant textual descriptions, falls short in practical video dubbing applications. Existing datasets like AudioSet, AudioCaps, Clotho, Sound-of-Story, and WavCaps do not fully meet the requirements for real-world foley audio dubbing task. To address this, we introduce the Multi-modal Image and Narrative Text Dubbing Dataset (MINT), designed to enhance mainstream dubbing tasks such as literary story audiobooks dubbing, image/silent video dubbing. Besides, to address the limitations of existing TTA technology in understanding and planning complex prompts, a Foley Audi

1 papers0 benchmarksAudio, Images, Texts, Videos

VSTaR-1M

VSTaR-1M is a 1M instruction tuning dataset, created using Video-STaR, with the source datasets: * Kinetics700 * STAR-benchmark * FineDiving

1 papers0 benchmarksTexts, Videos

MuseChat Dataset (MuseChat: A Conversational Music Recommendation System for Videos (CVPR 2024 Highlight Paper))

Music recommendation for videos attracts growing interest in multi-modal research. However, existing systems focus primarily on content compatibility, often ignoring the users’ preferences. Their inability to interact with users for further refinements or to provide explanations leads to a less satisfying experience. We address these issues with MuseChat, a first-of-its-kind dialogue-based recommendation system that personalizes music suggestions for videos. Our system consists of two key functionalities with associated modules: recommendation and reasoning. The recommendation module takes a video along with optional information including previous suggested music and user’s preference as inputs and retrieves an appropriate music matching the context. The reasoning module, equipped with the power of Large Language Model (Vicuna-7B) and extended to multi-modal inputs, is able to provide reasonable explanation for the recommended music. To evaluate the effectiveness of MuseChat, we build

1 papers0 benchmarksAudio, Texts, Videos

SRI-APPROVE Fine-Grained Video Classification

APPROVE consists of curated YouTube videos annotated with educational content. APPROVE consists of 193 hours of expert-annotated videos with 19 classes (7 literacy codes, 11 math, and background) and each video is associated with approximately 3 labels on average.

1 papers2 benchmarksVideos

Sieve & Swap - HowTo100M (Cooking)

Procedural videos show step-by-step demonstrations of tasks like recipe preparation. Understanding such videos is challenging, involving the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance, but demands significant computational resources. Furthermore, transcripts contain irrelevant content and exhibit style variation compared to instructions written by human annotators. To mitigate both issues, we propose a technique, Sieve-&-Swap, to automatically curate a smaller dataset: (i) Sieve filters irrelevant transcripts and (ii) Swap enhances the quality of the text instruction by automatically replacing the transcripts with human-written instructions from a text-only recipe dataset. The curated dataset, three orders of magnitude smaller than

1 papers0 benchmarksTexts, Videos

MentalHAD

Collecting data with a HIKVISION USB Camera DS-E11, we build a dataset called MentalHAD with four abnormal actions (climbing walls, hitting windows, climbing, and hitting) and six normal actions (crouching, standing, sitting, hand waving, walking, and running). It includes RGB videos of about 274 minutes (493504 frames) with 30 FPS in three different scenes, five subjects, and seven scene-subject pairs. They are organized into 69 sequences, each containing the data of only one action in about 2-5 minutes.

1 papers0 benchmarksVideos

PreviousPage 45 of 51Next