Datasets

3,275 machine learning datasets

3,275 dataset results

FlickrLogos-32

Object detection benchmark for logo detection.

VALSE (VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena)

We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.

26 papers4 benchmarksImages, Texts

UrbanCars

UrbanCars facilitates multi-shortcut learning under the controlled setting with two shortcuts—background and co-occurring object. The task is classifying the car body type into two categories: urban car and country car. The dataset contains three splits: training, validation, and testing. In the training set, two shortcuts spuriously correlate with the car body type. Both validation and testing sets are balanced, i.e., no spurious correlations. The validation set is used for model selection, and the testing set evaluates the mitigation of two shortcuts.

26 papers0 benchmarksImages

DUDE (Document UnderstanDing of Everything)

DUDE is formulated as an instance of Document Question Answering (DocQA) to evaluate how well current solutions deal with multi-page documents, if they can navigate and reason over the layout, and if they can generalize these skills to different document types and domains. Since we cannot provide question-answer pairs about, e.g., ticked checkboxes, on each document instance or document type, the challenge presented by DUDE is characterized equally as a Multi-Domain Long-Tailed Recognition problem

26 papers0 benchmarksImages, Texts

CrisisMMD

CrisisMMD is a large multi-modal dataset collected from Twitter during different natural disasters. It consists of several thousands of manually annotated tweets and images collected during seven major natural disasters including earthquakes, hurricanes, wildfires, and floods that happened in the year 2017 across different parts of the World. The provided datasets include three types of annotations.

25 papers0 benchmarksImages, Texts

TinyPerson

TinyPerson is a benchmark for tiny object detection in a long distance and with massive backgrounds. The images in TinyPerson are collected from the Internet. First, videos with a high resolution are collected from different websites. Second, images from the video are sampled every 50 frames. Then images with a certain repetition (homogeneity) are deleted, and the resulting images are annotated with 72,651 objects with bounding boxes by hand.

25 papers0 benchmarksImages

ROSE (Retinal OCTA SEgmentation dataset)

Retinal OCTA SEgmentation dataset (ROSE) consists of 229 OCTA images with vessel annotations at either centerline-level or pixel level.

25 papers0 benchmarksImages, Medical

DND (Darmstadt Noise Dataset)

Benchmarking Denoising Algorithms with Real Photographs

25 papers8 benchmarksImages

BAM! (Behance Artistic Media)

The Behance Artistic Media dataset (BAM!) is a large-scale dataset of contemporary artwork from Behance, a website containing millions of portfolios from professional and commercial artists. We annotate Behance imagery with rich attribute labels for content, emotions, and artistic media. We believe our Behance Artistic Media dataset will be a good starting point for researchers wishing to study artistic imagery and relevant problems.

25 papers0 benchmarksImages

MSU Video Super Resolution Benchmark: Detail Restoration

This is a dataset for a video super-resolution task. The dataset contains the most complex content for the restoration task: faces, text, QR-codes, car numbers, unpatterned textures, small details. Videos include different types of motion and different types of degradation: bicubic interpolation (BI) and Gaussian blurring and downsampling (BD). The resolution of all input video sequences is 480x320. Source: https://videoprocessing.ai/benchmarks/video-super-resolution.html Image Source: https://videoprocessing.ai/benchmarks/video-super-resolution.html

25 papers77 benchmarksImages, Videos

Chest ImaGenome

Chest ImaGenome is a dataset with a scene graph data structure to describe 242,072 images. Local annotations are automatically produced using a joint rule-based natural language processing (NLP) and atlas-based bounding box detection pipeline. Through a radiologist constructed CXR ontology, the annotations for each CXR are connected as an anatomy-centered scene graph, useful for image-level reasoning and multimodal fusion applications. Overall, the following are provided: i) 1256 combinations of relation annotations between 29 CXR anatomical locations (objects with bounding box coordinates) and their attributes, structured as a scene graph per image, ii) over 670,000 localized comparison relations (for improved, worsened, or no change) between the anatomical locations across sequential exams, as well as ii) a manually annotated gold standard scene graph dataset from 500 unique patients.

25 papers0 benchmarksImages

ELEVATER (Evaluation of Language-augmented Visual Task-level Transfer)

The ELEVATER benchmark is a collection of resources for training, evaluating, and analyzing language-image models on image classification and object detection. ELEVATER consists of:

25 papers5 benchmarksImages, Texts

DIOR-RSVG

DIOR-RSVG is a large-scale benchmark dataset of remote sensing data (RSVG). It aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models.

25 papers0 benchmarksImages

RecipeQA

RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. Each question in RecipeQA involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) joint understanding of images and text, (ii) capturing the temporal flow of events, and (iii) making sense of procedural knowledge.

24 papers1 benchmarksImages, Texts

DHF1K

DHF1K is a video saliency dataset which contains a ground-truth map of binary pixel-wise gaze fixation points and a continuous map of the fixation points after being blurred by a gaussian filter. DHF1K contains 1000 videos in total. 700 of the videos are annotated, 600 of which are used for training and 100 for validation. The remaining 300 are the testing set which are to be evaluated on a public server.

24 papers5 benchmarksImages, Videos

INRIA Person

The INRIA Person dataset is a dataset of images of persons used for pedestrian detection. It consists of 614 person detections for training and 288 for testing.

24 papers0 benchmarksImages

CCPD (Chinese City Parking Dataset)

The Chinese City Parking Dataset (CCPD) is a dataset for license plate detection and recognition. It contains over 250k unique car images, with license plate location annotations.

24 papers0 benchmarksImages

RADIATE (RAdar Dataset In Adverse weaThEr)

RADIATE (RAdar Dataset In Adverse weaThEr) is new automotive dataset created by Heriot-Watt University which includes Radar, Lidar, Stereo Camera and GPS/IMU. The data is collected in different weather scenarios (sunny, overcast, night, fog, rain and snow) to help the research community to develop new methods of vehicle perception. The radar images are annotated in 7 different scenarios: Sunny (Parked), Sunny/Overcast (Urban), Overcast (Motorway), Night (Motorway), Rain (Suburban), Fog (Suburban) and Snow (Suburban). The dataset contains 8 different types of objects (car, van, truck, bus, motorbike, bicycle, pedestrian and group of pedestrians).

24 papers4 benchmarksImages

Google Landmarks

The Google Landmarks dataset contains 1,060,709 images from 12,894 landmarks, and 111,036 additional query images. The images in the dataset are captured at various locations in the world, and each image is associated with a GPS coordinate. This dataset is used to train and evaluate large-scale image retrieval models.

24 papers0 benchmarksImages

FlickrStyle10K

FlickrStyle10K is collected and built on Flickr30K image caption dataset. The original FlickrStyle10K dataset has 10,000 pairs of images and stylized captions including humorous and romantic styles. However, only 7,000 pairs from the ofﬁcial training set are now publicly accessible. The dataset can be downloaded via https://zhegan27.github.io/Papers/FlickrStyle_v0.9.zip

24 papers3 benchmarksImages, Texts

PreviousPage 34 of 164Next