Datasets

3,275 machine learning datasets

3,275 dataset results

Kaleidoscope (Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation)

The evaluation of vision-language models (VLMs) has mainly relied on English-language benchmarks, leaving significant gaps in both multilingual and multicultural coverage. While multilingual benchmarks have expanded, both in size and languages, many rely on translations of English datasets, failing to capture cultural nuances. In this work, we propose Kaleidoscope, as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models. Kaleidoscope is a large-scale, in-language multimodal benchmark designed to evaluate VLMs across diverse languages and visual inputs. Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions. Built through an open science collaboration with a diverse group of researchers worldwide, Kaleidoscope ensures linguistic and cultural authenticity. We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages

2 papers0 benchmarksImages, Texts

CUHK01 (CUHK Person Re-identification)

This dataset contains 971 identities from two disjoint camera views. Each identity has two samples per camera view. It is used for Person Re-identification.

1 papers0 benchmarksImages

IRMA (15,363 IRMA images of 193 categories for ImageCLEFmed 2009)

This collection compiles anonymous radiographs, which have been arbitrarly selected from routine at the Department of Diagnostic Radiology, Aachen University of Technology (RWTH), Aachen, Germany. The imagery represents different ages, genders, view positions and pathologies. Therefore, image quality varies significantly. All images were downscaled to fit into a 512 x 512 bounding box maintaining the original aspect ratio. All images were classified according to the IRMA code. Based on this code, 193 categories were defined. For 12,677 images, these categories are provided. The remaining 1,733 images without code are used as test data for the ImageCLEFmed 2009 competition.

1 papers0 benchmarksImages

DroneDeploy

From DroneDeploy:

1 papers4 benchmarksImages

Pan+ChiPhoto

Pan+ChiPhoto dataset is a Chinese character dataset. It is built by the combination of two datasets: ChiPhoto and Pan_Chinese_Character dataset. The images in this dataset are mainly captured at outdoors in Beijing and Shanghai, China, which involve various scenes like signs, boards, advertisements, banners, objects with texts printed on their surfaces.

1 papers0 benchmarksImages, Texts

INRIA DLFD (INRIA Dense Light Field)

The INRIA Dense Light Field Dataset (DLFD) is a dataset for testing depth estimation methods in a light field. DLFD contains 39 scenes with disparity range [-4,4] pixels. The light fields are of spatial resolution 512 x 512 and angular resolution 9 x 9.

1 papers0 benchmarksImages

INRIA SLFD (INRIA Sparse Light Field)

The INRIA Sprse Light Field Dataset (SLFD) is a dataset for testing depth estimation methods in a light field. SLFD contains 53 scenes with disparity range [-20,20] pixels. The light fields are of spatial resolution 512 x 512 and angular resolution 9 x 9.

1 papers0 benchmarksImages

Daimler Monocular Pedestrian Detection

The Daimler Monocular Pedestrian Detection dataset is a dataset for pedestrian detection in urban environments. The training set contains 15560 pedestrian samples (image cut-outs at 48×96 resolution) and 6744 additional full images without pedestrians for extracting negative samples. The test set contains an independent sequence with more than 21790 images and 56492 pedestrian labels (fully visible or partially occluded), captured from a vehicle during a 27 min driving through the urban traffic.

1 papers0 benchmarksImages, Interactive, Videos

ETHZ-Shape

The ETHZ Shape dataset contains images of five diverse shape-based classes, collected from Flickr and Google Images. The main challenges it offers are clutter, intra-class shape variability, and scale changes. The authors deliberately selected several images where the object comprises only a rather small portion of the image, and made an effort to include objects appearing at a wide range of scales. The objects are mostly unoccluded and are all taken from approximately the same viewpoint (the side).

1 papers0 benchmarksImages

Short BBC Pose

Short BBC Pose contains five one-hour-long videos with sign language signers each with different sleeve length (in contrast to the BBC pose and Extended BBC Pose, which only contain signers with moderately long sleeves). Each of the five videos has 200 test frames (which have been manually annotated with joint locations), amounting to 1,000 test frames in total. Test frames were selected by the authors to contain a diverse range of poses.

1 papers0 benchmarksImages

VGG Cell

The VGG Cell dataset (made up entirely of synthetic images) is the main public benchmark used to compare cell counting techniques.

1 papers0 benchmarksImages, Medical

SceneNet RGB-D

SceneNet-RGBD is a synthetic dataset containing large-scale photorealistic renderings of indoor scene trajectories with pixel-level annotations. Random sampling permits virtually unlimited scene configurations, and the dataset creators provide a set of 5M rendered RGB-D images from over 15K trajectories in synthetic layouts with random but physically simulated object poses. Each layout also has random lighting, camera trajectories, and textures. The scale of this dataset is well suited for pre-training data-driven computer vision techniques from scratch with RGB-D inputs, which previously has been limited by relatively small labelled datasets in NYUv2 and SUN RGB-D. It also provides a basis for investigating 3D scene labelling tasks by providing perfect camera poses and depth data as proxy for a SLAM system.

1 papers0 benchmarksImages, RGB-D

Freiburg Street Crossing

The Freiburg Street Crossing dataset consists of data collected from three different street crossings in Freiburg, Germany; ; two of which were traffic light regulated intersections and one a zebra crossing without traffic lights. The data can be used to train agents to cross roads autonomously.

1 papers0 benchmarksImages

Freiburg Terrains

Freiburg Terrains consist of three parts: 3.7 hours of audio recordings of the microphone pointed at the robot wheels. It also contains 24K RGB images from the camera mounted on top of the robot. The dataset creators also provide the SLAM poses for each data collection run. The dataset can be used for terrain classification which is useful for agent navigation tasks.

1 papers0 benchmarksAudio, Images

DeepLocCross

DeepLocCross is a localization dataset that contains RGB-D stereo images captured at 1280 x 720 pixels at a rate of 20 Hz. The ground-truth pose labels are generated using a LiDAR-based SLAM system. In addition to the 6-DoF localization poses of the robot, the dataset additionally contains tracked detections of the observable dynamic objects. Each tracked object is identified using a unique track ID, spatial coordinates, velocity and orientation angle. Furthermore, as the dataset contains multiple pedestrian crossings, labels at each intersection indicating its safety for crossing are provided. This dataset consists of seven training sequences with a total of 2264 images, and three testing sequences with a total of 930 images. The dynamic nature of the surrounding environment at which the dataset was captured renders the tasks of localization and visual odometry estimation extremely challenging due to the varying weather conditions, presence of shadows and motion blur caused by the mov

1 papers0 benchmarksImages, RGB-D

NIST SD 19 (NIST Special Dataset 19)

NIST Special Database 19 contains NIST's entire corpus of training materials for handprinted document and character recognition. It publishes Handprinted Sample Forms from 3600 writers, 810,000 character images isolated from their forms, ground truth classifications for those images, reference forms for further data collection, and software utilities for image management and handling.

1 papers0 benchmarksImages

PRImA

The Prima head pose dataset consists of 2790 images of 15 persons recorded twice. Pitch values lie in the interval [−60∘,60∘], and yaw values lie in the interval [−90∘,90∘] with a 15∘ step. Thus, there are 93 poses available for each person. All the recordings were achieved with the same background. One interesting feature of this dataset is the pose space is uniformly sampled. The dataset is annotated such that a face bounding box (manually annotated) and the corresponding yaw and pitch angle values are provided for each sample.

1 papers1 benchmarksImages

KTH Multiview Football II

KTI Multiview Football II consists of images of professional footballers during a match of the Allsvenskan league. It consists of two parts: one with ground truth pose in 2D and one with ground truth pose in both 2D and 3D. The 3D dataset has 800 time frames, captured from 3 views (2400 images). Views are calibrated and synchronized. 3D ground truth pose and orthographic camera matrices are provided for each frame. There are 14 annotated joints. Lastly, there are 2 different players and two sequences per player.

1 papers0 benchmarksImages

RMRC 2014

The RMRC 2014 indoor dataset is a dataset for indoor semantic segmentation. It employs the NYU Depth V2 and Sun3D datasets to define the training set. The test data consists of newly acquired images.

1 papers0 benchmarksImages, RGB-D

NIH-LN (NIH-Lymph Node)

NIH-Lymph Node (NIH-LN) contains 388 mediastinal LNs in 90 CT scans and 595 abdominal LNs in 86 scans.

1 papers0 benchmarksImages, Medical

PreviousPage 108 of 164Next