Datasets

3,275 machine learning datasets

3,275 dataset results

Cats and Dogs

A large set of images of cats and dogs.

SynWoodScape (Synthetic Surround-view Fisheye Camera Dataset for Autonomous Driving)

SynWoodScape is a synthetic version of the surround-view dataset covering many of its weaknesses and extending it. WoodScape comprises four surround-view cameras and nine tasks, including segmentation, depth estimation, 3D bounding box detection, and a novel soiling detection. Semantic annotation of 40 classes at the instance level is provided for over 10,000 images. With WoodScape, we would like to encourage the community to adapt computer vision models for the fisheye camera instead of using naive rectification.

13 papers0 benchmarksImages

ViQuAE

ViQuAE is a dataset for KVQAE (Knowledge-based Visual Question Answering about named Entities), a task which consists in answering questions about named entities grounded in a visual context using a Knowledge Base. It is the first KVQAE dataset to cover a wide range of entity types (e.g. persons, landmarks, and products). We argue that KVQAE is a clear, well-defined task that can be evaluated easily, making it suitable to track the progress of multimodal entity representation’s quality. Multimodal entity representation is a central issue that will allow to make human-machine interactions more natural. For example, while watching a movie, one might wonder ‘‘Where did I already see this actress?’’ or ‘‘Did she ever win an Oscar?’’

13 papers0 benchmarksImages, Texts

Flare7K

Flare7K, the first nighttime flare removal dataset, which is generated based on the observation and statistic of real-world nighttime lens flares. It offers 5,000 scattering flare images and 2,000 reflective flare images, consisting of 25 types of scattering flares and 10 types of reflective flares. The 7,000 flare patterns can be randomly added to the flare-free images, forming the flare-corrupted and flare-free image pairs.

13 papers6 benchmarksImages

Cityscapes 3D

Detecting vehicles and representing their position and orientation in the three dimensional space is a key technology for autonomous driving. Recently, methods for 3D vehicle detection solely based on monocular RGB images gained popularity. In order to facilitate this task as well as to compare and drive state-of-the-art methods, several new datasets and benchmarks have been published. Ground truth annotations of vehicles are usually obtained using lidar point clouds, which often induces errors due to imperfect calibration or synchronization between both sensors. To this end, we propose Cityscapes 3D, extending the original Cityscapes dataset with 3D bounding box annotations for all types of vehicles. In contrast to existing datasets, our 3D annotations were labeled using stereo RGB images only and capture all nine degrees of freedom. This leads to a pixel-accurate reprojection in the RGB image and a higher range of annotations compared to lidar-based approaches. In order to ease multi

13 papers10 benchmarksImages

3D Ken Burns

This dataset accompanies our paper on synthesizing the 3D Ken Burns effect from a single image. It consists of 134041 captures from 32 virtual environments where each capture consists of 4 views. Each view contains color-, depth-, and normal-maps at a resolution of 512x512 pixels.

13 papers0 benchmarksImages, RGB-D, Stereo

SIMMC2.0

Next generation task-oriented dialog systems need to understand conversational contexts with their perceived surroundings, to effectively help users in the real-world multimodal environment. Existing task-oriented dialog datasets aimed towards virtual assistance fall short and do not situate the dialog in the user's multimodal context. To overcome, we present a new dataset for Situated and Interactive Multimodal Conversations, SIMMC 2.0, which includes 11K task-oriented user<->assistant dialogs (117K utterances) in the shopping domain, grounded in immersive and photo-realistic scenes. The dialogs are collected using a two-phase pipeline: (1) A novel multimodal dialog simulator generates simulated dialog flows, with an emphasis on diversity and richness of interactions, (2) Manual paraphrasing of the generated utterances to collect diverse referring expressions. We provide an in-depth analysis of the collected dataset, and describe in detail the four main benchmark tasks we propose. Our

13 papers3 benchmarksImages, Texts

WaterScenes

A Multi-Task 4D Radar-Camera Fusion Dataset for Autonomous Driving on Water Surfaces description of the dataset

13 papers6 benchmarksImages, Point cloud

DNA-Rendering

DNA-Rendering is a large-scale, high-fidelity repository of human performance data for neural actor rendering. It contains over 1500 human subjects, 5000 motion sequences, and 67.5M frames' data volume. Upon the massive collections, the authors provide human subjects with grand categories of pose actions, body shapes, clothing, accessories, hairdos, and object intersection, which ranges the geometry and appearance variances from everyday life to professional occasions. Second, they provide rich assets for each subject -- 2D/3D human body keypoints, foreground masks, SMPLX models, cloth/accessory materials, multi-view images, and videos. These assets boost the current method's accuracy on downstream rendering tasks. Third, they construct a professional multi-View system to capture data, which contains 60 synchronous cameras with max 4096×3000 resolution, 15 fps speed, and stern camera calibration steps, ensuring high-quality resources for task training and evaluation.

13 papers0 benchmarksImages, Videos

The COLOSSEUM (The COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation)

To realize effective large-scale, real-world robotic applications, we must evaluate how well our robot policies adapt to changes in environmental conditions. Unfortunately, a majority of studies evaluate robot performance in environments closely resembling or even identical to the training setup.

13 papers1 benchmarksImages, Texts

Touch and Go

This dataset encompasses a diverse range of tactile features that are instrumental in bifurcating various material properties. Three downstream tasks are considered: 1) categorization of materials, 2) distinction between hard and soft surfaces, and 3) distinction between smooth and textured surfaces.

13 papers0 benchmarksImages, Videos

Amazon Baby (Amazon Baby 5-core)

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

13 papers4 benchmarksImages, Texts

HRSC2016 (High resolution ship collections 2016)

High-resolution ship collections 2016 (HRSC2016) is a data set used for scientific research. Currently, all of the images in HRSC2016 were collected from Google Earth.

12 papers10 benchmarksImages

PeopleArt

People-Art is an object detection dataset which consists of people in 43 different styles. People contained in this dataset are quite different from those in common photographs. There are 42 categories of art styles and movements including Naturalism, Cubism, Socialist Realism, Impressionism, and Suprematism

12 papers20 benchmarksImages

TrajNet

The TrajNet Challenge represents a large multi-scenario forecasting benchmark. The challenge consists on predicting 3161 human trajectories, observing for each trajectory 8 consecutive ground-truth values (3.2 seconds) i.e., t−7,t−6,…,t, in world plane coordinates (the so-called world plane Human-Human protocol) and forecasting the following 12 (4.8 seconds), i.e., t+1,…,t+12. The 8-12-value protocol is consistent with the most trajectory forecasting approaches, usually focused on the 5-dataset ETH-univ + ETH-hotel + UCY-zara01 + UCY-zara02 + UCY-univ. Trajnet extends substantially the 5-dataset scenario by diversifying the training data, thus stressing the flexibility and generalization one approach has to exhibit when it comes to unseen scenery/situations. In fact, TrajNet is a superset of diverse datasets that requires to train on four families of trajectories, namely 1) BIWI Hotel (orthogonal bird’s eye flight view, moving people), 2) Crowds UCY (3 datasets, tilted bird’s eye view

12 papers0 benchmarksImages

HIC (Hands in Action)

The Hands in action dataset (HIC) dataset has RGB-D sequences of hands interacting with objects.

12 papers0 benchmarksImages, RGB-D, Videos

Partial-iLIDS

Partial iLIDS is a dataset for occluded person person re-identification. It contains a total of 476 images of 119 people captured by 4 non-overlapping cameras. Some images contain people occluded by other individuals or luggage.

12 papers0 benchmarksImages

TrashCan

The TrashCan dataset is an instance-segmentation dataset of underwater trash. It is comprised of annotated images (7,212 images) which contain observations of trash, ROVs, and a wide variety of undersea flora and fauna. The annotations in this dataset take the format of instance segmentation annotations: bitmaps containing a mask marking which pixels in the image contain each object. The imagery in TrashCan is sourced from the J-EDI (JAMSTEC E-Library of Deep-sea Images) dataset, curated by the Japan Agency of Marine Earth Science and Technology (JAMSTEC).

12 papers0 benchmarksImages

VOT2014 (Visual Object Tracking Challenge 2014)

The dataset comprises 25 short sequences showing various objects in challenging backgrounds. Eight sequences are from the VOT2013 challenge (bolt, bicycle, david, diving, gymnastics, hand, sunshade, woman). The new sequences show complementary objects and backgrounds, for example a fish underwater or a surfer riding a big wave. The sequences were chosen from a large pool of sequences using a methodology based on clustering visual features of object and background so that those 25 sequences sample evenly well the existing pool.

12 papers2 benchmarksImages, Tracking, Videos

FLAME (Fire Luminosity Airborne-based Machine learning Evaluation)

FLAME is a fire image dataset collected by drones during a prescribed burning piled detritus in an Arizona pine forest. The dataset includes video recordings and thermal heatmaps captured by infrared cameras. The captured videos and images are annotated and labeled frame-wise to help researchers easily apply their fire detection and modeling algorithms.

12 papers6 benchmarksImages

PreviousPage 46 of 164Next