Datasets

3,148 machine learning datasets

3,148 dataset results

RUFF

RUFF is a large-scale dataset to measure pronoun fidelity in English.

TwinViews-13k is a dataset of 13,855 pairs of left-leaning and right-leaning political statements, each pair matched by topic. It was created to study political bias in reward and language models, with a focus on understanding the interaction between model alignment to truthfulness and the emergence of political bias. The dataset was generated using GPT-3.5 Turbo, with extensive auditing to ensure ideological balance and topical relevance. This dataset can be used for various tasks related to political bias, natural language processing, and model alignment, particularly in studies examining how political orientation impacts model outputs.

2 papers0 benchmarksTexts

CII-Bench (Chinese Image Implication understanding Benchmark)

We introduce the Chinese Image Implication Understanding Benchmark CII-Bench, a new benchmark measuring the higher-order perceptual, reasoning and comprehension abilities of MLLMs when presented with complex Chinese implication images. These images, including abstract artworks, comics and posters, possess visual implications that require an understanding of visual details and reasoning ability. CII-Bench reveals whether current MLLMs, leveraging their inherent comprehension abilities, can accurately decode the metaphors embedded within the complex and abstract information presented in these images.

2 papers0 benchmarksImages, Texts

COFAR (Commonsense and Factual Reasoning in Image Search)

The COFAR (COmmonsense and FActual Reasoning) dataset is a collection of images and text queries specifically designed to challenge and evaluate image search models that aim to go beyond simple visual matching. It focuses on the ability of these models to perform commonsense and factual reasoning, a capability currently lacking in most existing image search technology.

2 papers2 benchmarksImages, Texts

HalluEditBench

HalluEditBench is a comprehensive benchmark for evaluating knowledge editing methods' effectiveness in correcting real-world hallucinations. HalluEdit features a rigorously constructed dataset spanning nine domains and 26 topics. It evaluates methods across five dimensions: Efficacy, Generalization, Portability, Locality, and Robustness.

2 papers0 benchmarksTexts

InfiniteBench (∞Bench: Extending Long Context Evaluation Beyond 100K Tokens)

Introduction Welcome to InfiniteBench, a cutting-edge benchmark tailored for evaluating the capabilities of language models to process, understand, and reason over super long contexts (100k+ tokens). Long contexts are crucial for enhancing applications with LLMs and achieving high-level interaction. InfiniteBench is designed to push the boundaries of language models by testing them against a context length of 100k+, which is 10 times longer than traditional datasets.

2 papers0 benchmarksTexts

Reddit Ideological and Extreme Bias Dataset

Articles originating from subreddits with explicitly stated ideologies are categorized into three groups: 72,488 articles in the Liberal class, 79,573 articles in the Conservative class, and 225,083 articles in the Restricted class.

2 papers2 benchmarksTables, Tabular, Texts

ThreatGram 101 - Extreme Telegram Data (ThreatGram 101 - Extreme Telegram Replies Data with Threat Levels)

Data 1: Raw and Unlabeled; 2 million unlabeled replies from 17 Telegram channels. Data 2: Raw and Labeled; 15,076 replies from 17 Telegram channels categorized as no threat, judicial threat, and non-judicial threat.

2 papers2 benchmarksTexts

Situation Puzzle

LLMs' lateral thinking capabilities remain under-explored and challenging to measure due to the complexity of assessing creative thought processes and the scarcity of relevant data. To address these challenges, we introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit LAteral Thinking of LLMs. This benchmark, containing 975 graded situation puzzles across three difficulty levels.

2 papers0 benchmarksTexts

WildDESED (Wild Domestic Environment Sound Event Detection)

WildDESED is an extension of the original DESED dataset, created to reflect various domestic scenarios by incorporating complex and unpredictable background noises. These enhancements make WildDESED a powerful resource for developing and evaluating noise-robust SED systems.

2 papers5 benchmarksAudio, Texts

SQL-Eval

SQL-Eval is an open-source PostgreSQL evaluation dataset released by Defog, constructed based on Spider. The original link can be found at https://github.com/defog-ai/sql-eval. Our evaluation methodology is more stringent, as it compares the execution accuracy of the predicted SQL queries against the sole ground truth SQL query.

2 papers2 benchmarksTexts

ROPE (Recognition-based Object Probing Evaluation)

We introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity. Different types of instruction settings of ROPE. In a single turn of prompting without format enforcement, we probe the model to recognize the 5 objects referred to by the visual prompts (a) one at a time in the single-object setting and (b) concurrently in the multi-object setting. We further enforce the model to follow the format template and decode only the object tokens for each of the five objects (c) without output manipulation in student forcing and (d) replacing all previously generated object tokens with the ground truth classes in teacher forcing.

2 papers0 benchmarksImages, Texts

MolParser-7M

A large scale OCSR dataset, proposed in paper “MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild“ MolParser-7M contains nearly 8 million paired image-SMILES data. It should be noted that the caption of image is extended-SMILES format proposed in paper.

2 papers0 benchmarksImages, Texts

DISC-Law-SFT

DISC-Law-SFT comprises two subsets, DISC-Law-SFT-Pair and DISC-Law-SFT-Triplet. The former aims to introduce legal reasoning abilities to the LLM, while the latter helps enhance the model's capability to utilize external legal knowledge.

2 papers0 benchmarksTexts

Yo'LLaVA

40 personalized concepts

2 papers0 benchmarksImages, Texts

ALM-Bench (All Languages Matter Benchmark)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarksImages, Texts

GEOBench-VLM

GEOBench-VLM, a comprehensive benchmark specifically designed to evaluate VLMs on geospatial tasks, including scene understanding, object counting, localization, fine-grained categorization, and temporal analysis. Our benchmark features over 10,000 manually verified instructions and covers a diverse set of variations in visual conditions, object type, and scale.

2 papers0 benchmarksImages, Texts

Underwater Trash Detection

Underwater Trash Detection Dataset Overview The Underwater Trash Detection Dataset is a custom-annotated dataset designed to address the challenges of underwater trash detection caused by varying environmental features. Publicly available datasets alone are insufficient for training deep learning models due to domain-specific variations in underwater conditions. This dataset offers a cumulative, self-annotated collection of underwater images for detecting and classifying trash, providing a strong foundation for deep learning research and benchmark testing.

2 papers0 benchmarksImages, LiDAR, Texts

V2VBench

V2VBench is a comprehensive benchmark designed to evaluate video editing methods. It consists of: - 50 standardized videos across 5 categories, and - 3 editing prompts per video, encompassing 4 editing tasks: Huggingface Datasets - 8 evaluation metrics to assess the quality of edited videos: Evaluation Metrics

2 papers0 benchmarksImages, Texts, Videos

TOMG-Bench (Text-based Open Molecule Generation Benchmark)

In this paper, we propose Text-based Open Molecule Generation Benchmark (TOMG-Bench), the first benchmark to evaluate the open-domain molecule generation capability of LLMs. TOMG-Bench encompasses a dataset of three major tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom). Each task further contains three subtasks, with each subtask comprising 5,000 test samples. Given the inherent complexity of open molecule generation, we have also developed an automated evaluation system that helps measure both the quality and the accuracy of the generated molecules. Our comprehensive benchmarking of 25 LLMs reveals the current limitations and potential areas for improvement in text-guided molecule discovery. Furthermore, with the assistance of OpenMolIns, a specialized instruction tuning dataset proposed for solving challenges raised by TOMG-Bench, Llama3.1-8B could outperform all the open-source general LLMs, even surpassing GPT-3.5-tu

2 papers1 benchmarksGraphs, Texts

PreviousPage 100 of 158Next