19,997 machine learning datasets
19,997 dataset results
Due to the lack of training data for video waterdrop removal, we propose a large-scale synthetic dataset with simulated waterdrops in complex driving scenes on rainy days.
We collect, organize and open-source the large-scale multimodal instruction dataset, Infinity-MM, consisting of tens of millions of samples. Through quality filtering and deduplication, the dataset has high quality and diversity. We propose a synthetic data generation method based on open-source models and labeling system, using detailed image annotations and diverse question generation.
We introduce Multi-IF, a new benchmark designed to assess LLMs' proficiency in following multi-turn and multilingual instructions. Multi-IF, which utilizes a hybrid framework combining LLM and human annotators, expands upon the IFEval by incorporating multi-turn sequences and translating the English prompts into another 7 languages, resulting in a dataset of 4501 multilingual conversations, where each has three turns. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. All the models tested showed a higher rate of failure in executing instructions correctly with each additional turn. For example, o1-preview drops from 0.877 at the first turn to 0.707 at the third turn in terms of average accuracy over all languages. Moreover, languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models’ multilingual capabiliti
ComplexCodeEval ComplexCodeEval is an evaluation benchmark designed to accommodate multiple downstream tasks, accurately reflect different programming environments, and deliberately avoid data leakage issues. This benchmark includes a diverse set of samples from real-world projects, aiming to closely mirror actual development scenarios.
A synthetically generated QA dataset for text-based reasoning. For each sample, composed of a True/False question over two pieces of information required to answer it (the context), we create multiple versions of different lengths by embedding the context parts within longer, irrelevant texts. To ensure that models utilize their entire input, the dataset is composed of tasks for which both pieces of information must reasoned over together in order to correctly answer the question. At the same time, we keep the tasks simple enough such that models answer most of them correctly when the information pieces are presented on their own, with no additional padding.
Introudced from Multimodal Meme Dataset (MultiOFF) for Identifying Offensive Content in Image and Text
The Helvipad dataset is a real-world stereo dataset designed for omnidirectional depth estimation. It comprises 39,553 paired equirectangular images captured using a top-bottom 360° camera setup and corresponding pixel-wise depth and disparity labels derived from LiDAR point clouds. The dataset spans diverse indoor and outdoor scenes under varying lighting conditions, including night-time environments.
This is dataset for high dynamic range novel view synthesis. It is collected by HDR-NeRF and recalibrated by HDR-GS for the research of 3DGS-based algorithms. This dataset contains 8 synthetic scenes and 4 real scenes.
We introduce HourVideo, a benchmark dataset for hour-long video-language understanding. HourVideo consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features 12,976 high-quality, five-way multiple-choice questions. We hope to establish HourVideo as a benchmark challenge to spur the development of advanced multimodal models capable of truly understanding endless streams of visual data.
Large multimodal models (LMMs) are processing increasingly longer and richer inputs. Albeit the progress, few public benchmark is available to measure such development. To mitigate this gap, we introduce LongVideoBench, a question-answering benchmark that features video-language interleaved inputs up to an hour long. Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes, designed to comprehensively evaluate LMMs on long-term multimodal understanding. To achieve this, we interpret the primary challenge as to accurately retrieve and reason over detailed multimodal information from long inputs. As such, we formulate a novel video question-answering task termed referring reasoning. Specifically, as part of the question, it contains a referring query that references related video contexts, called referred context. The model is then required to reason over relevant video details from the referred context. Following the paradigm of referri
A large-scale and diverse duet interactive dance dataset. Recording about 117 minutes of professional dancers' performances.
The PECAN dataset provides structural data for antibody-antigen interactions, specifically curated for paratope and epitope binding site prediction. It includes a diverse set of antibody-antigen complexes, ensuring a well-balanced and representative dataset for training and evaluating deep learning models in protein-protein interaction (PPI) tasks.
A Dense-text Image Benchmark to evaluate large generation model's ability on text generation.
We introduce a challenging and comprehensive benchmark for open-instruction 6-DoF object rearrangement tasks, termed Open6DOR.
a multidimensional AVR benchmark with 770 puzzles composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations.
StreamingBench evaluates Multimodal Large Language Models (MLLMs) in real-time, streaming video understanding tasks. 🌟
RefRef is a synthetic dataset and benchmark designed for the task of reconstructing scenes with complex refractive and reflective objects. Our dataset consists of 50 objects categorized based on their geometric and material complexity: single-material convex objects, single-material non-convex objects, and multi-material non-convex objects, where the materials have different colors, opacities, and refractive indices. Each object is placed in three distinct bounded environments and one unbounded environment, resulting in 150 unique scenes with diverse geometries, material properties, and backgrounds. Our dataset provides a controlled setting for evaluating and developing 3D reconstruction and novel view synthesis methods that handle complex optical effects.
The Audio Signal and Information Processing Lab at Westlake University, in collaboration with AISHELL, has released the Real-recorded and annotated Microphone Array speech&Noise (RealMAN) dataset, which provides annotated multi-channel speech and noise recordings for dynamic speech enhancement and localization:
The ImplicitQA dataset was introduced in the paper ImplicitQA: Going beyond frames towards Implicit Video Reasoning.
PCBA dataset 11 is a collection of high-quality dose-response data, formulated as a multitask learning benchmark from 128 high-throughput screening (HTS) assays.