19,997 machine learning datasets
19,997 dataset results
The COCO-MIG benchmark (Common Objects in Context Multi-Instance Generation) is a benchmark used to evaluate the generation capability of generators on text containing multiple attributes of multi-instance objects. This benchmark consists of 800 sets of examples sampled from the COCO dataset. Following the layout of the COCO dataset, each instance is assigned random color information, and corresponding global image descriptions are constructed according to templates. The COCO-MIG also provides a complete pipeline for resampling and evaluating. For relevant tools and specific details, please refer to our project's homepage.
Serving as an out-of-domain xMR test dataset, MSVAMP allows for a more exhaustive and comprehensive evaluation of the model’s multilingual mathematical capabilities. Researchers can assess model performance beyond the training data domain using this dataset.
We present the HANDAL dataset for category-level object pose estimation and affordance prediction. Unlike previous datasets, ours is focused on robotics-ready manipulable objects that are of the proper size and shape for functional grasping by robot manipulators, such as pliers, utensils, and screwdrivers. Our annotation process is streamlined, requiring only a single off-the-shelf camera and semi-automated processing, allowing us to produce high-quality 3D annotations without crowd-sourcing. The dataset consists of 308k annotated image frames from 2.2k videos of 212 real-world objects in 17 categories. We focus on hardware and kitchen tool objects to facilitate research in practical scenarios in which a robot manipulator needs to interact with the environment beyond simple pushing or indiscriminate grasping. We outline the usefulness of our dataset for 6-DoF category-level pose+scale estimation and related tasks. We also provide 3D reconstructed meshes of all objects, and we outline s
The SB10k dataset is a valuable resource for sentiment analysis in German. Here are the key details:
The OSM dataset, sourced from OpenStreetMap, is composed of the rasterized semantic maps and height fields of 80 cities worldwide, spanning an area of more than 6,000 km^2. During the rasterization process, vectorized geometry information is converted into images by translating longitude and latitude into the EPSG:3857 coordinate system at zoom level 18, approximately 0.597 meters per pixel.
VideoXum is an enriched large-scale dataset for cross-modal video summarization. The dataset is built on ActivityNet Captions. The datasets includes three subtasks: Video-to-Video Summarization (V2V-SUM), Video-to-Text Summarization (V2T-SUM), and Video-to-Video&Text Summarization (V2VT-SUM).
AesBench is an expert benchmark designed to comprehensively evaluate the aesthetic perception capacities of Multimodal Large Language Models (MLLMs) when it comes to image aesthetics perception. Let me break it down for you:
Vibe-Eval is a new open benchmark and framework for evaluating multimodal chat models¹². It was introduced by Reka Technologies⁴ and is designed to rigorously test these models' visual understanding capabilities⁴. Here are some key points about Vibe-Eval:
Benchmarking initiatives support the meaningful comparison of competing solutions to prominent problems in speech and language processing. Successive benchmarking evaluations typically reflect a progressive evolution from ideal lab conditions towards to those encountered in the wild. ASVspoof, the spoofing and deepfake detection initiative and challenge series, has followed the same trend. This article provides a summary of the ASVspoof 2021 challenge and the results of 54 participating teams that submitted to the evaluation phase. For the logical access (LA) task, results indicate that countermeasures are robust to newly introduced encoding and transmission effects. Results for the physical access (PA) task indicate the potential to detect replay attacks in real, as opposed to simulated physical spaces, but a lack of robustness to variations between simulated and real acoustic environments. The Deepfake (DF) task, new to the 2021 edition, targets solutions to the detection of manipula
SafeBench is a benchmarking platform designed for the safety evaluation of autonomous vehicles (AVs) in safety-critical scenarios¹. It aims to provide a unified platform that integrates various types of safety-critical testing scenarios, scenario generation algorithms, and other variations such as driving routes and environments¹. The platform implements four deep reinforcement learning-based AV algorithms with four types of input to perform fair comparisons on SafeBench¹.
The Arena-Hard-Auto benchmark is an automatic evaluation tool for instruction-tuned Language Learning Models (LLMs)¹. It was developed to provide a cheaper and faster approximation to human preference¹.
The ModelNet40 zero-shot 3D classification performance of models pretrained on ShapeNet only.
MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset, are designed to evaluate and improve LVLMs' abilities in multi-turn and multi-image conversations.
A benchmark to evaluate the tool-use capabilities of LLM-based agents in real-world scenarios.
The HInt dataset is frequently used as a <b>generalizability benchmark</b> for 3D Hand Reconstruction. It features three data subsets: HInt-NewDays, HInt-VISOR and HInt-Ego4D subsets and it aims to complement existing datasets used for training and evaluation 3D hand pose estimation. HInt annotates 2D keypoint locations and occlusion labels for 21 keypoints on the hand. It is built off of 3 existing datasets (Hands23, Epic-Kitchens VISOR, and Ego4D) and provides annotations for images from the three existing datasets.
These are 10 synthetic genomics datasets generated with NEAT v3 (based on TP53 gene of Homo Sapiens) for the use case of benchmarking somatic variant callers. To find more about our generating framework please visit synth4bench GitHub repository.
RFUND is a relabeled version of FUNSD and XFUND datasets, tackling the following issues in their original annotations:
English subset of RFUND
Recent advancements in speech generation models have been significantly driven by the use of large-scale training data. However, producing highly spontaneous, human-like speech remains a challenge due to the scarcity of large, diverse, and spontaneous speech datasets. In response, we introduce Emilia, the first large-scale, multilingual, and diverse speech generation dataset. Emilia starts with over 101k hours of speech across six languages, covering a wide range of speaking styles to enable more natural and spontaneous speech generation. To facilitate the scale-up of Emilia, we also present Emilia-Pipe, the first open-source preprocessing pipeline designed to efficiently transform raw, in-the-wild speech data into high-quality training data with speech annotations. Experimental results demonstrate the effectiveness of both Emilia and Emilia-Pipe. Demos are available at: https://emilia-dataset.github.io/Emilia-Demo-Page/.
This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).