19,997 machine learning datasets
19,997 dataset results
UPLight is an underwater RGB-Polarization multimodal semantic segmentation dataset with 12 typical underwater semantic classes.
T2Ranking is a large-scale Chinese benchmark for passage ranking. T2Ranking comprises more than 300K queries and over 2M unique passages from real-world search engines. Expert annotators are recruited to provide 4-level graded relevance scores (fine-grained) for query-passage pairs instead of binary relevance judgments (coarse-grained). To ease the false negative issues, more passages with higher diversities are considered when performing relevance annotations, especially in the test set, to ensure a more accurate evaluation. Apart from the textual query and passage data, other auxiliary resources are also provided, such as query types and XML files of documents which passages are generated from, to facilitate further studies.
SummEdits is a benchmark designed to measure the ability of Large Language Models (LLMs) to reason about facts and detect inconsistencies. It was proposed as a new protocol for inconsistency detection benchmark creation.
The tStoryCloze refers to the "Topic StoryCloze" benchmark, which is a spoken version of the StoryCloze textual benchmark. In the tStoryCloze, the goal is to evaluate continuation coherence given a spoken prompt. This version is designed to be easier compared to the fine-grained causal and temporal commonsense relations evaluated in the spoken benchmark called sStoryCloze. The tStoryCloze benchmark is used to assess the capabilities of SpeechLMs to maintain coherence in the spoken language given a prompt. This benchmark, along with the sStoryCloze, provides valuable insights into the performance of SpeechLMs in capturing different aspects of spoken content.
The image aesthetic benchmark [18] consists of 10800 Flickr photos of four categories, i.e., “animals”, “urban”, “people” and “nature”, and is constructed originally to retrieve beautiful yet unpopular images in social networks. The ground truths of the photos in the benchmark are five aesthetic grades: “Unacceptable” - images with extremely low quality, out of focus or underexposed, “Flawed” - images with some technical flaws and without any artistic value, “Ordinary” - standard quality images without technical flaws, “Professional” - professional-quality images with some artistic value, and “Exceptional” - very appealing images showing both outstanding professional quality and high artistic value.
Cancer in the region of the head and neck (HaN) is one of the most prominent cancers, for which radiotherapy represents an important treatment modality that aims to deliver a high radiation dose to the targeted cancerous cells while sparing the nearby healthy organs-at-risk (OARs). A precise three-dimensional spatial description, i.e. segmentation, of the target volumes as well as OARs is required for optimal radiation dose distribution calculation, which is primarily performed using computed tomography (CT) images. However, the HaN region contains many OARs that are poorly visible in CT, but better visible in magnetic resonance (MR) images. Although attempts have been made towards the segmentation of OARs from MR images, so far there has been no evaluation of the impact the combined analysis of CT and MR images has on the segmentation of OARs in the HaN region. The Head and Neck Organ-at-Risk Multi-Modal Segmentation Challenge aims to promote the development of new and application of
MatSynth MatSynth is a Physically Based Rendering (PBR) materials dataset designed for modern AI applications. This dataset consists of over 4,000 ultra-high resolution, offering unparalleled scale, diversity, and detail.
Understanding what makes a video memorable has a very broad range of current applications, e.g., education and learning, content retrieval and search, content summarization, storytelling, targeted advertising, content recommendation and filtering. This task requires participants to automatically predict memorability scores for videos that reflect the probability for a video to be remembered over both a short and long term. Participants will be provided with an extensive data set of videos with memorability annotations, related information, pre-extracted state-of-the-art visual features, and Electroencephalography (EEG) recordings.
The ToolE dataset encompasses various types of user queries in the form of prompts. These prompts trigger LLMs to utilize tools, including both single-tool and multi-tool scenarios. The dataset serves as a valuable resource for assessing LLMs’ understanding of tool functionality and their ability to select appropriate tools for specific tasks.
The DREAM dataset is introduce by the paper "Camera-to-Robot Pose Estimation from a Single Image" (ICRA 2020). This dataset consists of synthetic images (both with and without domain randomlization) of three different robot manipulators (Franka Emika’s Panda, Kuka’s LBR iiwa 7 R800, and Rethink Robotics’ Baxter) , as well as real-world images of Franka Emika’s Panda taken from various RGBD cameras (XBox 360 Kinect (XK), RealSense (RS), and Azure Kinect (AK)). Each instance in the dataset contains an RGB image, keypoint 3D/2D coordinates , global camera-to-robot transformation and joint state configurations (from both revolute and prismatic joint) of the robot. Tasks like estimating robot pose (camera pose) from a single RGB image, camera-to-robot calibration can be conducted and evaluated in this dataset.
Climate models are critical tools for analyzing climate change and projecting its future impact. The machine learning (ML) community has taken an increased interest in supporting climate scientists’ efforts on various tasks such as climate model emulation, downscaling, and prediction tasks. However, traditional datasets based on single climate models are limiting. We thus present ClimateSet — a comprehensive collection of inputs and outputs from 36 climate models sourced from the Input4MIPs and CMIP6 archives, designed for large-scale ML applications.
MUSIC-AVQA v2.0 balances the original MUSIC-AVQA dataset in each QA category and sub-category. This balance results in a more reliable benchmark than the original MUSIC-AVQA dataset. See more details at paper "Tackling Data Bias in MUSIC-AVQA: Crafting a Balanced Dataset for Unbiased Question-Answering" published at WACV 2024.
This is a GPS trajectory dataset collected in (Microsoft Research Asia) GeoLife project by 182 users in a period of over three years (from April 2007 to August 2012). A GPS trajectory of this dataset is represented by a sequence of time-stamped points, each of which contains the information of latitude, longitude and altitude. This dataset contains 17,621 trajectories with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours. These trajectories were recorded by different GPS loggers and GPS-phones, and have a variety of sampling rates. 91 percent of the trajectories are logged in a dense representation, e.g. every 1~5 seconds or every 5~10 meters per point.
BLUEX is a valuable benchmark dataset designed to evaluate language models in Portuguese. Despite Portuguese being the fifth most widely spoken language, there is a scarcity of freely available resources for assessing language models in this language. The BLUEX dataset addresses this gap by providing a multimodal collection of questions from the two leading university entrance exams conducted in Brazil: Convest (Unicamp) and Fuvest (USP). These exams span from 2018 to 2024 and cover a total of 1260 questions. Notably, 724 of these questions do not have accompanying images.
The HateBR dataset is a significant resource for studying offensive language and hate speech detection in Brazilian Portuguese. Here are the key details about this dataset:
The TweetSentBR Dataset is a valuable resource for sentiment analysis in Brazilian Portuguese. Let me provide you with some details about it:
Syntax-Aware Fill-in-the-Middle (SAFIM) is a benchmark for evaluating Large Language Models (LLMs) on the code Fill-in-the-Middle (FIM) task. SAFIM has three subtasks: Algorithmic Block Completion, Control-Flow Expression Completion, and API Function Call Completion. SAFIM is sourced from code submitted from April 2022 to January 2023 to minimize the impact of data contamination on evaluation results.
Person re-ID matches persons across multiple non-overlapping cameras. Despite the increasing deployment of airborne platforms in surveillance, current existing person re-ID benchmarks' focus is on ground-ground matching and very limited efforts on aerial-aerial matching. We propose a new benchmark dataset - AG-ReID, which performs person re-ID matching in a new setting: across aerial and ground cameras. Our dataset contains 21,983 images of 388 identities and 15 soft attributes for each identity. The data was collected by a UAV flying at altitudes between 15 to 45 meters and a ground-based CCTV camera on a university campus. Our dataset presents a novel elevated-viewpoint challenge for person re-ID due to the significant difference in person appearance across these cameras.
RARE consists of English AMR pairs with similarity scores that reflect the structural differences between them.
In the Learning to Summarize from Human Feedback paper, a reward model was trained from human feedback. The reward model was then used to train a summarization model to align with human preferences. This is the dataset of human feedback that was released for reward modelling. There are two parts of this dataset: comparisons and axis. In the comparisons part, human annotators were asked to choose the best out of two summaries. In the axis part, human annotators gave scores on a likert scale for the quality of a summary. The comparisons part only has a train and validation split, and the axis part only has a test and validation split.