3,148 machine learning datasets
3,148 dataset results
RUFF is a large-scale dataset to measure pronoun fidelity in English.
TwinViews-13k is a dataset of 13,855 pairs of left-leaning and right-leaning political statements, each pair matched by topic. It was created to study political bias in reward and language models, with a focus on understanding the interaction between model alignment to truthfulness and the emergence of political bias. The dataset was generated using GPT-3.5 Turbo, with extensive auditing to ensure ideological balance and topical relevance. This dataset can be used for various tasks related to political bias, natural language processing, and model alignment, particularly in studies examining how political orientation impacts model outputs.
We introduce the Chinese Image Implication Understanding Benchmark CII-Bench, a new benchmark measuring the higher-order perceptual, reasoning and comprehension abilities of MLLMs when presented with complex Chinese implication images. These images, including abstract artworks, comics and posters, possess visual implications that require an understanding of visual details and reasoning ability. CII-Bench reveals whether current MLLMs, leveraging their inherent comprehension abilities, can accurately decode the metaphors embedded within the complex and abstract information presented in these images.
The COFAR (COmmonsense and FActual Reasoning) dataset is a collection of images and text queries specifically designed to challenge and evaluate image search models that aim to go beyond simple visual matching. It focuses on the ability of these models to perform commonsense and factual reasoning, a capability currently lacking in most existing image search technology.
HalluEditBench is a comprehensive benchmark for evaluating knowledge editing methods' effectiveness in correcting real-world hallucinations. HalluEdit features a rigorously constructed dataset spanning nine domains and 26 topics. It evaluates methods across five dimensions: Efficacy, Generalization, Portability, Locality, and Robustness.
Introduction Welcome to InfiniteBench, a cutting-edge benchmark tailored for evaluating the capabilities of language models to process, understand, and reason over super long contexts (100k+ tokens). Long contexts are crucial for enhancing applications with LLMs and achieving high-level interaction. InfiniteBench is designed to push the boundaries of language models by testing them against a context length of 100k+, which is 10 times longer than traditional datasets.
Articles originating from subreddits with explicitly stated ideologies are categorized into three groups: 72,488 articles in the Liberal class, 79,573 articles in the Conservative class, and 225,083 articles in the Restricted class.
Data 1: Raw and Unlabeled; 2 million unlabeled replies from 17 Telegram channels. Data 2: Raw and Labeled; 15,076 replies from 17 Telegram channels categorized as no threat, judicial threat, and non-judicial threat.
LLMs' lateral thinking capabilities remain under-explored and challenging to measure due to the complexity of assessing creative thought processes and the scarcity of relevant data. To address these challenges, we introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit LAteral Thinking of LLMs. This benchmark, containing 975 graded situation puzzles across three difficulty levels.
WildDESED is an extension of the original DESED dataset, created to reflect various domestic scenarios by incorporating complex and unpredictable background noises. These enhancements make WildDESED a powerful resource for developing and evaluating noise-robust SED systems.
SQL-Eval is an open-source PostgreSQL evaluation dataset released by Defog, constructed based on Spider. The original link can be found at https://github.com/defog-ai/sql-eval. Our evaluation methodology is more stringent, as it compares the execution accuracy of the predicted SQL queries against the sole ground truth SQL query.
We introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity. Different types of instruction settings of ROPE. In a single turn of prompting without format enforcement, we probe the model to recognize the 5 objects referred to by the visual prompts (a) one at a time in the single-object setting and (b) concurrently in the multi-object setting. We further enforce the model to follow the format template and decode only the object tokens for each of the five objects (c) without output manipulation in student forcing and (d) replacing all previously generated object tokens with the ground truth classes in teacher forcing.
A large scale OCSR dataset, proposed in paper “MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild“ MolParser-7M contains nearly 8 million paired image-SMILES data. It should be noted that the caption of image is extended-SMILES format proposed in paper.
DISC-Law-SFT comprises two subsets, DISC-Law-SFT-Pair and DISC-Law-SFT-Triplet. The former aims to introduce legal reasoning abilities to the LLM, while the latter helps enhance the model's capability to utilize external legal knowledge.
40 personalized concepts
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
GEOBench-VLM, a comprehensive benchmark specifically designed to evaluate VLMs on geospatial tasks, including scene understanding, object counting, localization, fine-grained categorization, and temporal analysis. Our benchmark features over 10,000 manually verified instructions and covers a diverse set of variations in visual conditions, object type, and scale.
Underwater Trash Detection Dataset Overview The Underwater Trash Detection Dataset is a custom-annotated dataset designed to address the challenges of underwater trash detection caused by varying environmental features. Publicly available datasets alone are insufficient for training deep learning models due to domain-specific variations in underwater conditions. This dataset offers a cumulative, self-annotated collection of underwater images for detecting and classifying trash, providing a strong foundation for deep learning research and benchmark testing.
V2VBench is a comprehensive benchmark designed to evaluate video editing methods. It consists of: - 50 standardized videos across 5 categories, and - 3 editing prompts per video, encompassing 4 editing tasks: Huggingface Datasets - 8 evaluation metrics to assess the quality of edited videos: Evaluation Metrics
In this paper, we propose Text-based Open Molecule Generation Benchmark (TOMG-Bench), the first benchmark to evaluate the open-domain molecule generation capability of LLMs. TOMG-Bench encompasses a dataset of three major tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom). Each task further contains three subtasks, with each subtask comprising 5,000 test samples. Given the inherent complexity of open molecule generation, we have also developed an automated evaluation system that helps measure both the quality and the accuracy of the generated molecules. Our comprehensive benchmarking of 25 LLMs reveals the current limitations and potential areas for improvement in text-guided molecule discovery. Furthermore, with the assistance of OpenMolIns, a specialized instruction tuning dataset proposed for solving challenges raised by TOMG-Bench, Llama3.1-8B could outperform all the open-source general LLMs, even surpassing GPT-3.5-tu