Datasets

19,997 machine learning datasets

19,997 dataset results

MiniHack

MiniHack is a sandbox framework for easily designing rich and diverse environments for Reinforcement Learning (RL). MiniHack includes a collection of example environments that can be used to test various capabilities of RL agents, as well as serve as building blocks for researchers wishing to develop their own environments. MiniHack's navigation tasks challenge the agent to reach the goal position by overcoming various difficulties on their way, such as fighting monsters in corridors, crossing a river by pushing boulders into it, navigating through complex, procedurally generated mazes, etc. MiniHack's skill acquisition tasks enable utilising the rich diversity of NetHack objects, monsters and dungeon features, and the interactions between them. The skill acquisition tasks feature a large action space (75 actions), where the actions are instantiated differently depending on which object they are acting on.

26 papers0 benchmarks

M2DGR (a Multi-modal and Multi-scenario SLAM Dataset for Ground Robots)

We collected long-term challenging sequences for ground robots both indoors and outdoors with a complete sensor suite, which includes six surround-view fish-eye cameras, a sky-pointing fish-eye camera, a perspective color camera, an event camera, an infrared camera, a 32-beam LIDAR, two GNSS receivers, and two IMUs. To our knowledge, this is the first SLAM dataset focusing on ground robot navigation with such rich sensory information. We recorded trajectories in a few challenging scenarios like lifts, complete darkness, which can easily fail existing localization solutions. These situations are commonly faced in ground robot applications, while they are seldom discussed in previous datasets. We launched a comprehensive benchmark for ground robot navigation. On this benchmark, we evaluated existing state-of-the-art SLAM algorithms of various designs and analyzed their characteristics and defects individually.

26 papers0 benchmarks

CVSS

CVSS is a massively multilingual-to-English speech to speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems

26 papers2 benchmarksAudio, Speech, Texts

FairytaleQA

FairytaleQA is a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Annotated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly story narratives, covering seven types of narrative elements or relations. It can support narrative Question Generation (QG) and Narrative Question Answering (QA) tasks.

26 papers3 benchmarksTexts

Chart-to-text

Chart-to-text is a large-scale benchmark with two datasets and a total of 44,096 charts covering a wide range of topics and chart types.

26 papers0 benchmarks

Animal Kingdom

Animal Kingdom is a large and diverse dataset that provides multiple annotated tasks to enable a more thorough understanding of natural animal behaviors. The wild animal footage used in the dataset records different times of the day in an extensive range of environments containing variations in backgrounds, viewpoints, illumination and weather conditions. More specifically, the dataset contains 50 hours of annotated videos to localize relevant animal behavior segments in long videos for the video grounding task, 30K video sequences for the fine-grained multi-label action recognition task, and 33K frames for the pose estimation task, which correspond to a diverse range of animals with 850 species across 6 major animal classes.

26 papers6 benchmarksImages, Videos

FlickrLogos-32

Object detection benchmark for logo detection.

26 papers7 benchmarksImages

VALSE (VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena)

We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.

26 papers4 benchmarksImages, Texts

RSITMD

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

26 papers9 benchmarks

UrbanCars

UrbanCars facilitates multi-shortcut learning under the controlled setting with two shortcuts—background and co-occurring object. The task is classifying the car body type into two categories: urban car and country car. The dataset contains three splits: training, validation, and testing. In the training set, two shortcuts spuriously correlate with the car body type. Both validation and testing sets are balanced, i.e., no spurious correlations. The validation set is used for model selection, and the testing set evaluates the mitigation of two shortcuts.

26 papers0 benchmarksImages

DIV2KRK (DIV2K Random Kernel)

Using the validation set (100 images) from the widely used DIV2K dataset, we blurred and subsampled each image with a different, randomly generated kernel. Kernels were 11x11 anisotropic gaussians with random lengths λ1, λ2∼U(0.6, 5) independently distributed for each axis, rotated by a random angle θ∼U[−π, π].

26 papers0 benchmarks

SemanticKITTI-C

🤖 Robo3D - The SemanticKITTI-C Benchmark SemanticKITTI-C is an evaluation benchmark heading toward robust and reliable 3D semantic segmentation in autonomous driving. With it, we probe the robustness of 3D segmentors under out-of-distribution (OoD) scenarios against corruptions that occur in the real-world environment. Specifically, we consider natural corruptions happen in the following cases:

26 papers3 benchmarksPoint cloud

DALES (DALES: A Large-scale Aerial LiDAR Data Set for Semantic Segmentation)

We present the Dayton Annotated LiDAR Earth Scan (DALES) data set, a new large-scale aerial LiDAR data set with over a half-billion hand-labeled points spanning 10 square kilometers of area and eight object categories. Large annotated point cloud data sets have become the standard for evaluating deep learning methods. However, most of the existing data sets focus on data collected from a mobile or terrestrial scanner with few focusing on aerial data. Point cloud data collected from an Aerial Laser Scanner (ALS) presents a new set of challenges and applications in areas such as 3D urban modeling and large-scale surveillance. DALES is the most extensive publicly available ALS data set with over 400 times the number of points and six times the resolution of other currently available annotated aerial point cloud data sets. This data set gives a critical number of expert verified hand-labeled points for the evaluation of new 3D deep learning algorithms, helping to expand the focus of curren

26 papers21 benchmarks3D, LiDAR, Point cloud

XStoryCloze

XStoryCloze consists of the professionally translated version of the English StoryCloze dataset (Spring 2016 version) to 10 non-English languages. This dataset is intended to be used for evaluating the zero- and few-shot learning capabilities of multlingual language models. This dataset is released by Meta AI.

26 papers0 benchmarksTexts

HumanEvalPack

HumanEvalPack is an extension of OpenAI's HumanEval to cover 6 total languages across 3 tasks. The evaluation suite is fully created by humans.

26 papers2 benchmarks

DUDE (Document UnderstanDing of Everything)

DUDE is formulated as an instance of Document Question Answering (DocQA) to evaluate how well current solutions deal with multi-page documents, if they can navigate and reason over the layout, and if they can generalize these skills to different document types and domains. Since we cannot provide question-answer pairs about, e.g., ticked checkboxes, on each document instance or document type, the challenge presented by DUDE is characterized equally as a Multi-Domain Long-Tailed Recognition problem

26 papers0 benchmarksImages, Texts

ImageNet-1k vs iNaturalist

A benchmark dataset for out-of-distribution detection. ImageNet-1k is in-distribution, while iNaturalist is out-of-distribution.

26 papers3 benchmarks

Q-Bench

The Q-Bench includes three realms for low-level vision: perception (A1), description (A2), and assessment (A3). - For perception (A1) /description (A2), we collect two benchmark datasets LLVisionQA/LLDescribe. - For assessment (A3), as we use public datasets, we provide an abstract evaluation code for arbitrary MLLMs for anyone to test.

26 papers0 benchmarks

MSU SR-QA Dataset (MSU Super-Resolution Quality Assessment Dataset)

Our dataset was made of videos from MSU Video Upscalers Benchmark Dataset, MSU Video Super-Resolution Benchmark Dataset and MSU Super-Resolution for Video Compression Benchmark Dataset. Dataset consists of real videos (were filmed with 2 cameras), video games footages, movies, cartoons, dynamic ads.

26 papers12 benchmarksVideos

AlignBench

AlignBench is a comprehensive benchmark designed specifically for evaluating the alignment performance of large Chinese language models (LLMs). It focuses on assessing how well these models align with human intent across multiple dimensions. Let me provide you with more details:

26 papers0 benchmarks

PreviousPage 89 of 1000Next