Datasets

1,019 machine learning datasets

1,019 dataset results

EgoPAT3D-DT

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

GroOT (Grounded Multiple Object Tracking)

One of the recent trends in vision problems is to use natural language captions to describe the objects of interest. This approach can overcome some limitations of traditional methods that rely on bounding boxes or category annotations. This paper introduces a novel paradigm for Multiple Object Tracking called Type-to-Track, which allows users to track objects in videos by typing natural language descriptions. We present a new dataset for that Grounded Multiple Object Tracking task, called GroOT, that contains videos with various types of objects and their corresponding textual captions of 256K words describing their appearance and action in detail. To cover a diverse range of scenes, GroOT was created using official videos and bounding box annotations from the MOT17, TAO and MOT20.

3 papers0 benchmarksTexts, Tracking, Videos

VATEX Adverbs

VATEX Adverbs is a subset from VATEX with extracted verb-adverb annotations. VATEX Adverbs contains 34 adverbs appearing across 135 actions, forming 1,550 unique action-adverb pairs in 14,617 video clips.

3 papers9 benchmarksActions, Videos

ActivityNet Adverbs

ActivityNet Adverbs is a subset from the ActivityNet dataset with extracted verb-adverb annotations. ActivityNet Adverbs contains 20 adverbs appearing across 114 actions, forming 643 unique action-adverb pairs in 3,099 video clips.

3 papers9 benchmarksActions, Videos

MSR-VTT Adverbs

MSR-VTT Adverbs is a subset from MSR-VTT with extracted verb-adverb annotations. MSR-VTT Adverbs contains 18 adverbs appearing across 106 actions, forming 464 unique action-adverb pairs in 1,824 video clips.

3 papers9 benchmarksActions, Videos

Shot2Story20K

A short clip of video may contain progression of multiple events and an interesting story line. A human needs to capture both the event in every shot and associate them together to understand the story behind it.

3 papers16 benchmarksAudio, Texts, Videos

SoccerNet-GSR (SoccerNet Game State Reconstruction)

The SoccerNet Game State Reconstruction task is a novel high level computer vision task that is specific to sports analytics. It aims at recognizing the state of a sport game, i.e., identifying and localizing all sports individuals (players, referees, ..) on the field based on a raw input videos. SoccerNet-GSR is composed of 200 video sequences of 30 seconds, annotated with 9.37 million line points for pitch localization and camera calibration, as well as over 2.36 million athlete positions on the pitch with their respective role, team, and jersey number.

3 papers0 benchmarksRGB Video, Tracking, Videos

Ego4D-HCap

Ego4D-HCap is a hierarchical video captioning dataset comprised of a three-tier hierarchy of captions: short clip-level captions, medium-length video segment descriptions, and long-range video-level summaries. To construct Ego4D-HCap, we leverage Ego4D, the largest publicly available egocentric video dataset. While Ego4D comes with time-stamped atomic captions and video-segment descriptions spanning up to 5 minutes, it lacks video-level summaries for longer video durations. To address this issue, we annotate a subset of 8,267 Ego4D videos with long-range video summaries, each spanning up to two hours. This enhancement provides a three-level hierarchy of captions.

3 papers0 benchmarksTexts, Videos

MOMA-LRG (Multi-Object Multi-Actor activity parsing with Language-Refined Graphs)

A dataset dedicated to multi-object, multi-actor activity parsing.

3 papers4 benchmarksTexts, Videos

GraSP (Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies)

Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies (GraSP) dataset, a curated benchmark that models surgical scene understanding as a hierarchy of complementary tasks with varying levels of granularity. Our approach enables a multi-level comprehension of surgical activities, encompassing long-term tasks such as surgical phases and steps recognition and short-term tasks including surgical instrument segmentation and atomic visual actions detection. To exploit our proposed benchmark, we introduce the Transformers for Actions, Phases, Steps, and Instrument Segmentation (TAPIS) model, a general architecture that combines a global video feature extractor with localized region proposals from an instrument segmentation model to tackle the multi-granularity of our benchmark. Through extensive experimentation, we demonstrate the impact of including segmentation annotations in short-term recognition tasks, highlight the varying granularity requirements of each task, and

3 papers1 benchmarksMedical, Videos

MM-OR

Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing surgical assistance, situational awareness, and patient safety. Current datasets fall short in scale, realism and do not capture the multimodal nature of OR scenes, limiting progress in OR modeling. To this end, we introduce MM-OR, a realistic and large-scale multimodal spatiotemporal OR dataset, and the first dataset to enable multimodal scene graph generation. MM-OR captures comprehensive OR scenes containing RGB-D data, detail views, audio, speech transcripts, robotic logs, and tracking data and is annotated with panoptic segmentations, semantic scene graphs, and downstream task labels. Further, we propose MM2SG, the first multimodal large vision-language model for scene graph generation, and through extensive experiments, demonstrate its ability to effectively leverage multimodal inputs. Together, MM-OR and MM2SG establi

3 papers7 benchmarks3D, Audio, Graphs, Images, Medical, Point cloud, RGB-D, Speech, Texts, Time series, Videos

MPEblink

The pioneering eyeblink detection dataset is characterized by three key features: (1) Sample with multi-human instances. (2) Unconstrained in-the-wild scenarios. (3) Untrimmed videos. These attributes make the dataset more challenging and better aligned with real-world scenarios.

3 papers1 benchmarksVideos

Udacity

The Udacity dataset is mainly composed of video frames taken from urban roads. It provides a total number of 404,916 video frames for training and 5,614 video frames for testing. This dataset is challenging due to severe lighting changes, sharp road curves and busy traffic.

2 papers3 benchmarksImages, Videos

Skeleton-Mimetics

A dataset derived from the recently introduced Mimetics dataset.

2 papers10 benchmarksVideos

CholecT40 (Cholecystectomy Action Triplet)

CholecT40 is the first endoscopic dataset introduced to enable research on fine-grained action recognition in laparoscopic surgery.

2 papers2 benchmarksImages, RGB Video, Videos

Stanford-ECM

Stanford-ECM is an egocentric multimodal dataset which comprises about 27 hours of egocentric video augmented with heart rate and acceleration data. The lengths of the individual videos cover a diverse range from 3 minutes to about 51 minutes in length. A mobile phone was used to collect egocentric video at 720x1280 resolution and 30 fps, as well as triaxial acceleration at 30Hz. The mobile phone was equipped with a wide-angle lens, so that the horizontal field of view was enlarged from 45 degrees to about 64 degrees. A wrist-worn heart rate sensor was used to capture the heart rate every 5 seconds. The phone and heart rate monitor was time-synchronized through Bluetooth, and all data was stored in the phone’s storage. Piecewise cubic polynomial interpolation was used to fill in any gaps in heart rate data. Finally, data was aligned to the millisecond level at 30 Hz.

2 papers0 benchmarksAudio, Videos

Headcam

This dataset contains panoramic video captured from a helmet-mounted camera while riding a bike through suburban Northern Virginia.

2 papers0 benchmarksVideos

VideoForensicsHQ

VideoForensicsHQ is a benchmark dataset for face video forgery detection, providing high quality visual manipulations. It is one of the first face video manipulation benchmark sets that also contains audio and thus complements existing datasets along a new challenging dimension. VideoForensicsHQ shows manipulations at much higher video quality and resolution, and shows manipulations that are provably much harder to detect by humans than videos in other datasets.

2 papers0 benchmarksVideos

EGOK360

Contains annotations of human activity with different sub-actions, e.g., activity Ping-Pong with four sub-actions which are pickup-ball, hit, bounce-ball and serve.

2 papers0 benchmarksVideos

MCAD (Multi-Camera Action Dataset)

Designed to evaluate the open view classification problem under the surveillance environment. In total, MCAD contains 14,298 action samples from 18 action categories, which are performed by 20 subjects and independently recorded with 5 cameras.

2 papers0 benchmarksVideos

PreviousPage 32 of 51Next