Datasets

1,019 machine learning datasets

1,019 dataset results

Camouflaged Animal Dataset

The nine (moving camera) videos in this benchmark exhibit camouflaged animals that are difficult to see in a single frame, but can be detected based upon their motion across frames.

4 papers35 benchmarksVideos

Multi-Label Classification Dataset Repository

For each dataset we provide a short description as well as some characterization metrics. It includes the number of instances (m), number of attributes (d), number of labels (q), cardinality (Card), density (Dens), diversity (Div), average Imbalance Ratio per label (avgIR), ratio of unconditionally dependent label pairs by chi-square test (rDep) and complexity, defined as m × q × d as in [Read 2010]. Cardinality measures the average number of labels associated with each instance, and density is defined as cardinality divided by the number of labels. Diversity represents the percentage of labelsets present in the dataset divided by the number of possible labelsets. The avgIR measures the average degree of imbalance of all labels, the greater avgIR, the greater the imbalance of the dataset. Finally, rDep measures the proportion of pairs of labels that are dependent at 99% confidence. A broader description of all the characterization metrics and the used partition methods are described in

4 papers0 benchmarksAudio, Biology, Images, Medical, Music, Texts, Videos

MA-52 (Micro-Action 52 dataset)

The MA-52 dataset provides the whole-body perspective including gestures, upper- and lower-limb movements, attempting to reveal comprehensive micro-action cues. In detail, MA-52 contains 52 micro-action categories along with seven body part labels, and encompasses a full array of realistic and natural micro-actions, accounting for 205 participants and 22,422 video instances collated from the psychological interviews

4 papers4 benchmarksVideos

Human-Animal-Cartoon

Human-Animal-Cartoon (HAC) dataset consists of seven actions (‘sleeping’, ‘watching tv’, ‘eating’, ‘drinking’, ‘swimming’, ‘running’, and ‘opening door’) performed by humans, animals, and cartoon figures, forming three different domains. 3381 video clips are collected from the internet with around 1000 for each domain and three modalities are provided in the dataset: video, audio, and optical flow.

4 papers0 benchmarksAudio, Videos

MMToM-QA (Multimodal Theory of Mind Question Answering)

MMToM-QA is the first multimodal benchmark to evaluate machine Theory of Mind (ToM), the ability to understand people's minds. MMToM-QA consists of 600 questions. Each question is paired with a clip of the full activity in a video (as RGB-D frames), as well as a text description of the scene and the actions taken by the person in that clip. All questions have two choices. The questions are categorized into seven types, evaluating belief inference and goal inference in rich and diverse situations. Each belief inference type has 100 questions, totaling 300 belief questions; each goal inference type has 75 questions, totaling 300 goal questions. The questions are paired with 134 videos of a person looking for daily objects in household environments.

4 papers0 benchmarksImages, RGB Video, RGB-D, Texts, Videos

MSRVTT-CTN (MSRVTT Causal-Temporal Narrative)

MSRVTT-CTN Dataset This dataset contains CTN annotations for the MSRVTT-CTN benchmark dataset in JSON format. It has three files for the train, test, and validation splits. For project details, visit https://narrativebridge.github.io/.

4 papers3 benchmarksTexts, Videos

MSVD-CTN (MSVD Causal-Temporal Narrative)

MSVD-CTN Dataset This dataset contains CTN annotations for the MSVD-CTN benchmark dataset in JSON format. It has three files for the train, test, and validation splits. For project details, visit https://narrativebridge.github.io/.

4 papers3 benchmarksTexts, Videos

RoadTextVQA

Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition in motion is a challenging problem, while textual cues typically appear for a short time span, and early detection at a distance is necessary. Systems that exploit such information to assist the driver should not only extract and incorporate visual and textual cues from the video stream but also reason over time. To address this issue, we introduce RoadTextVQA, a new dataset for the task of video question answering (VideoQA) in the context of driver assistance. RoadTextVQA consists of 3,222 driving videos collected from multiple countries, annotated with 10,500 questions, all based on text or road signs present in the driving videos. We assess the performance of state-of-the-art video question answering models on our RoadTextVQA dataset, highlighting the significant potential for improvement in this domain and the usefulness of the dataset in

4 papers1 benchmarksTexts, Videos

HateMM

Hate speech has become one of the most significant issues in modern society, with implications in both the online and offline worlds. However, most of the work has primarily focused on text media, with relatively little work on images and even less on videos. Thus, early-stage automated video moderation techniques are needed to handle the videos that are being uploaded to keep the platform safe and healthy. Therefore, we curated approximately ~43 hours of videos from BitChute and manually annotated them as hate or non-hate, along with the frame spans that could explain the labeling decision.

4 papers2 benchmarksAudio, Videos

M$^3$-VOS (M$^3$-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation)

💡 Description A new benchmark, Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation (M$^3$-VOS), to verify the ability of models to understand object phases, which consists of 479 high-resolution videos spanning over 10 distinct everyday scenarios. We collected 205,181 masks, with an average track duration of 14.27s. M$^3$-VOS covers 120+ categories of objects across 6 phases within 14 scenarios, encompassing 23 specific phase transitions.

4 papers2 benchmarksImages, Texts, Videos

ETH BIWI Walking Pedestrians

The BIWI Walking Pedestrians dataset consists of walking pedestrians in busy scenarios from a birds eye view.

3 papers0 benchmarksImages, Videos

HDM05

HDM05 is a MoCap (motion capture) dataset. It contains more than three hours of systematically recorded and well-documented motion capture data in the C3D as well as in the ASF/AMC data format. HDM05 contains almost 2337 sequences with 130 motion classes performed by 5 different actors.

3 papers8 benchmarksImages, Videos

DCASE 2014

DCASE2014 is an audio classification benchmark.

3 papers0 benchmarksAudio, Videos

M-VAD Names (M-VAD Names Dataset)

The dataset contains the annotations of characters' visual appearances, in the form of tracks of face bounding boxes, and the associations with characters' textual mentions, when available. The detection and annotation of the visual appearances of characters in each video clip of each movie was achieved through a semi-automatic approach. The released dataset contains more than 24k annotated video clips, including 63k visual tracks and 34k textual mentions, all associated with their character identities.

3 papers1 benchmarksTexts, Videos

MMDB (Multimodal Dyadic Behavior)

Multimodal Dyadic Behavior (MMDB) dataset is a unique collection of multimodal (video, audio, and physiological) recordings of the social and communicative behavior of toddlers. The MMDB contains 160 sessions of 3-5 minute semi-structured play interaction between a trained adult examiner and a child between the age of 15 and 30 months. The MMDB dataset supports a novel problem domain for activity recognition, which consists of the decoding of dyadic social interactions between adults and children in a developmental context.

3 papers0 benchmarksAudio, Videos

AFEW-VA (AFEW-VA Database for Valence and Arousal Estimation In-The-Wild)

The AFEW-VA databaset is a collection of highly accurate per-frame annotations levels of valence and arousal, along with per-frame annotations of 68 facial landmarks for 600 challenging video clips. These clips are extracted from feature films and were also annotated in terms of discrete emotion categories in the form of the AFEW database (that can be obtained there).

3 papers0 benchmarksVideos

Deep Fakes Dataset (inamibora)

The Deep Fakes Dataset is a collection of "in the wild" portrait videos for deepfake detection. The videos in the dataset are diverse real-world samples in terms of the source generative model, resolution, compression, illumination, aspect-ratio, frame rate, motion, pose, cosmetics, occlusion, content, and context. They originate from various sources such as news articles, forums, apps, and research presentations; totalling up to 142 videos, 32 minutes, and 17 GBs. Synthetic videos are matched with their original counterparts when possible.

3 papers0 benchmarksVideos

Composable activities dataset

The Composable activities dataset consists of 693 videos that contain activities in 16 classes performed by 14 actors. Each activity is composed of 3 to 11 atomic actions. RGB-D data for each sequence is captured using a Microsoft Kinect sensor and estimate position of relevant body joints.

3 papers0 benchmarksRGB-D, Videos

Ford Campus Vision and Lidar Data Set

Ford Campus Vision and Lidar Data Set is a dataset collected by an autonomous ground vehicle testbed, based upon a modified Ford F-250 pickup truck. The vehicle is outfitted with a professional (Applanix POS LV) and consumer (Xsens MTI-G) Inertial Measuring Unit (IMU), a Velodyne 3D-lidar scanner, two push-broom forward looking Riegl lidars, and a Point Grey Ladybug3 omnidirectional camera system.

3 papers0 benchmarksLiDAR, Point cloud, Videos

MERL Shopping

MERL Shopping is a dataset for training and testing action detection algorithms. The MERL Shopping Dataset consists of 106 videos, each of which is a sequence about 2 minutes long. The videos are from a fixed overhead camera looking down at people shopping in a grocery store setting. Each video contains several instances of the following 5 actions: "Reach To Shelf" (reach hand into shelf), "Retract From Shelf " (retract hand from shelf), "Hand In Shelf" (extended period with hand in the shelf), "Inspect Product" (inspect product while holding it in hand), and "Inspect Shelf" (look at shelf while not touching or reaching for the shelf).

3 papers0 benchmarksVideos

PreviousPage 29 of 51Next