Datasets

1,019 machine learning datasets

1,019 dataset results

WITS (Why Is This Sarcastic?)

This dataset is an extension of MASAC, a multimodal, multi-party, Hindi-English code-mixed dialogue dataset compiled from the popular Indian TV show, ‘Sarabhai v/s Sarabhai’. WITS was created by augmenting MASAC with natural language explanations for each sarcastic dialogue. The dataset consists of the transcribed sarcastic dialogues from 55 episodes of the TV show, along with audio and video multimodal signals. It was designed to facilitate Sarcasm Explanation in Dialogue (SED), a novel task aimed at generating a natural language explanation for a given sarcastic dialogue, that spells out the intended irony. Each data instance in WITS is associated with a corresponding video, audio, and textual transcript where the last utterance is sarcastic in nature. All the final selected explanations contain the following attributes:

5 papers9 benchmarksAudio, Texts, Videos

TRECVID-AVS16 (IACC.3)

Internet Archive videos (IACC.3) under Creative Commons licenses. The test video collection for TRECVID-AVS2016-TRECVID-AVS2018 contains 335,944 web video clips (600hr).

5 papers1 benchmarksTexts, Videos

TRECVID-AVS17 (IACC.3)

Internet Archive videos (IACC.3) under Creative Commons licenses. The test video collection for TRECVID-AVS2016-TRECVID-AVS2018 contains 335,944 web video clips (600hr).

5 papers1 benchmarksTexts, Videos

TRECVID-AVS18 (IACC.3)

Internet Archive videos (IACC.3) under Creative Commons licenses. The test video collection for TRECVID-AVS2016-TRECVID-AVS2018 contains 335,944 web video clips (600hr).

5 papers1 benchmarksTexts, Videos

VCSL (Video Copy Segment Localization)

VCSL (Video Copy Segment Localization) is a new comprehensive segment-level annotated video copy dataset. Compared with existing copy detection datasets restricted by either video-level annotation or small-scale, VCSL not only has two orders of magnitude more segment level labelled data, with 160k realistic video copy pairs containing more than 280k localized copied segment pairs, but also covers a variety of video categories and a wide range of video duration. All the copied segments inside each collected video pair are manually extracted and accompanied by precisely annotated starting and ending timestamps.

5 papers0 benchmarksVideos

ComPhy (Compositional Physical Reasoning Dataset)

**Compositional Physical Reasoning is a dataset for understanding object-centric and relational physics properties hidden from visual appearances. For a given set of objects, the dataset includes few videos of them moving and interacting under different initial conditions. The model is evaluated based on its capability to unravel the compositional hidden properties, such as mass and charge, and use this knowledge to answer a set of questions posted on one of the videos.

5 papers0 benchmarksVideos

TCIA 4D-Lung

This data collection consists of images acquired during chemoradiotherapy of 20 locally-advanced, non-small cell lung cancer patients. The images include four-dimensional (4D) fan beam (4D-FBCT) and 4D cone beam CT (4D-CBCT). All patients underwent concurrent radiochemotherapy to a total dose of 64.8-70 Gy using daily 1.8 or 2 Gy fractions. scription of the dataset.

5 papers0 benchmarksBiomedical, Images, Videos

Vi-Fi Multi-modal Dataset

A large-scale multi-modal dataset to facilitate research and studies that concentrate on vision-wireless systems. The Vi-Fi dataset is a large-scale multi-modal dataset that consists of vision, wireless and smartphone motion sensor data of multiple participants and passer-by pedestrians in both indoor and outdoor scenarios. In Vi-Fi, vision modality includes RGB-D video from a mounted camera. Wireless modality comprises smartphone data from participants including WiFi FTM and IMU measurements.

5 papers3 benchmarksRGB Video, RGB-D, Time series, Videos

CoP3D

CoP3D is a collection of crowd-sourced videos showing around 4,200 distinct pets. CoP2D is a large-scale datasets for benchmarking non-rigid 3D reconstruction "in the wild".

5 papers0 benchmarksVideos

VGGSound-Sparse

The dataset uses VGG-Sound which consists of 10s clips collected from YouTube for 309 sound classes. A subset of ‘temporally sparse’ classes is selected using the following procedure: 5–15 videos are randomly picked from each of the 309 VGGSound classes, and manually annotated as to whether audio-visual cues are only sparsely available. As a result, 12 classes are selected (∼4 %) or 6.5k and 0.6k videos in the train and test sets, respectively. The classes include 'dog barking', 'chopping wood', 'lion roaring', 'skateboarding' etc.

5 papers0 benchmarksActions, Audio, Images, Videos

RF100 (Roboflow 100)

The evaluation of object detection models is usually performed by optimizing a single metric, e.g. mAP, on a fixed set of datasets, e.g. Microsoft COCO and Pascal VOC. Due to image retrieval and annotation costs, these datasets consist largely of images found on the web and do not represent many real-life domains that are being modelled in practice, e.g. satellite, microscopic and gaming, making it difficult to assert the degree of generalization learned by the model.

5 papers1 benchmarksImages, Videos

BRACE (The Breakdancing Competition Dataset for Dance Motion Synthesis)

BRACE is a dataset for audio-conditioned dance motion synthesis challenging common assumptions for this task:

5 papers30 benchmarksActions, Audio, Point cloud, Videos

Distress Analysis Interview Corpus/Wizard-of-Oz set (DAIC-WOZ)

The Distress Analysis Interview Corpus/Wizard-of-Oz set (DAIC-WOZ) dataset [24, 25] comprises voice and text samples from 189 interviewed healthy and control persons and their PHQ-8 depression detection questionnaire. This dataset is commonly used in research works for text-based detection, voice-based detection, and in multi-modal architecture

5 papers0 benchmarksAudio, Texts, Videos

WEAR (WEAR: An Outdoor Sports Dataset for Wearable and Egocentric Activity Recognition)

WEAR is an outdoor sports dataset for both vision- and inertial-based human activity recognition (HAR). The dataset comprises data from 22 participants performing a total of 18 different workout activities with untrimmed inertial (acceleration) and camera (egocentric video) data recorded at 11 different outside locations. Unlike previous egocentric datasets, WEAR provides a challenging prediction scenario marked by purposely introduced activity variations as well as an overall small information overlap across modalities.

5 papers0 benchmarksTime series, Videos

RoCoG-v2 (Robot Control Gestures)

RoCoG-v2 (Robot Control Gestures) is a dataset intended to support the study of synthetic-to-real and ground-to-air video domain adaptation. It contains over 100K synthetically-generated videos of human avatars performing gestures from seven (7) classes. It also provides videos of real humans performing the same gestures from both ground and air perspectives

5 papers2 benchmarksVideos

LaRS (Lakes, Rivers and Seas Dataset)

LaRS is the largest and most diverse panoptic maritime obstacle detection dataset.

5 papers27 benchmarksImages, Videos

BUP20 (Sweet Pepper 2020 University of Bonn)

Video sequences from a glasshouse environment in Campus Kleinaltendorf(CKA), University of Bonn, captured by PATHoBot, a glasshouse monitoring robot.

5 papers0 benchmarksImages, RGB Video, RGB-D, Videos

EgoPAT3D

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

5 papers0 benchmarks3D, Videos

VidChapters-7M

VidChapters-7M is a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters and hence without any additional manual annotation. It is designed for training and evaluating models for video chapter generation with or without ground-truth boundaries, and video chapter grounding, as well as for video-language pretraining.

5 papers16 benchmarksTexts, Videos

Test-of-Time (Test of Time Synthetic Video Dataset)

The goal of this dataset is to probe video-language models for understanding of simple temporal relations like "before" and "after". The dataset is only meant to be an evaluation set and not a training set.

5 papers2 benchmarksTexts, Videos

PreviousPage 26 of 51Next