3,148 machine learning datasets
3,148 dataset results
GPR‑bench is an open‑source, multilingual benchmark for regression testing and reproducibility tracking in generative‑AI systems. It provides a compact yet diverse suite of prompts and reference outputs that let you verify whether a model (or prompt) change alters output quality in undesirable ways.
This is the C++ dataset used in the TASTY research paper which was published at the ICLR DL4Code (Deep Learning for Code) workshop.
This dataset contains over 47,000 LEGO structures of over 28,000 unique 3D objects accompanied by detailed captions. It was used to train LegoGPT, the first approach for generating physically stable LEGO brick models from text prompts.
We provide here a new multi-view text dataset, collected from three well-known online news sources: BBC, Reuters, and The Guardian. This dataset exhibits a number of common aspects of multi-view problems highlighted previously -- notably that certain stories will not be reported by all three sources (i.e incomplete views), and the related issue that sources vary in their coverage of certain topics (i.e. partially missing patterns).
Dataset Overview: 998 images and 4,208 annotations focusing on interaction with in-vehicle infotainment (IVI) systems. Key Features:
AC-Bench: A Benchmark for Actual Causality Reasoning Dataset Description AC-Bench is designed to evaluate the actual causality (AC) reasoning capabilities of large language models (LLMs). It contains a collection of carefully annotated samples, each consisting of a story, a query related to actual causation, detailed reasoning steps, and a binary answer. The dataset aims to provide a comprehensive benchmark for assessing the ability of LLMs to perform formal and interpretable AC reasoning.
first everyday task dataset featuring COT outputs, diverse task designs, detailed re-plan processes, along with SFT and DPO sub-datasets.
measure the toxicity generated by language models across input severity and harm categories, by creating a new benchmark of open ended prefixes. We sampled 10,376 snippets from web pages across the dimensions and harms as described in the paper - https://arxiv.org/pdf/2505.02009.
pymatgen_code_qa benchmark: qa_benchmark/generated_qa/generation_results_code.json, which consists of 34,621 QA pairs. pymatgen_code_doc benchmark: qa_benchmark/generated_qa/generation_results_doc.json, which consists of 34,604 QA pairs. real-world tool-usage benchmark: src/question_segments, which consists of 49 questions (138 tasks). One subfolder means a question with problem statement, property list and verification code.
Genomics Adversarial Attack Sample dataset
🤗 MigrationBench is a large-scale code migration benchmark dataset at the repository level, across multiple programming languages.
🤗 MigrationBench is a large-scale code migration benchmark dataset at the repository level, across multiple programming languages.
🤗 MigrationBench is a large-scale code migration benchmark dataset at the repository level, across multiple programming languages.
With the rise of social media, user-generated content has surged, and hate speech has proliferated. Hate speech targets groups or individuals based on race, religion, gender, region, sexual orientation, or physical traits, expressing malice or inciting harm. Recognized as a growing social issue, it affects 9.41 billion Mandarin Chinese speakers (12% of the global population). However, research on Chinese hate speech detection lags, facing two key challenges.
Ultra-lightweight, multilingual QA eval dataset for rapid testing LLMs.
Verireason-RTL-Coder_7b_reasoning_tb For implementation details, visit our GitHub repository: VeriReason
Verireason-RTL-Coder_7b_reasoning_tb_simple For implementation details, visit our GitHub repository: VeriReason
PsOCR is a large-scale synthetic dataset for Optical Character Recognition in low-resource Pashto language.
ValiMath is a high-quality benchmark consisting of 2,147 carefully curated mathematical questions designed to evaluate an LLM's ability to verify the correctness of math questions based on multiple logic-based and structural criteria.
Role-Playing Eval (RPEval) is a benchmark dataset designed to evaluate large language models' role-playing abilities across emotional understanding, decision-making, moral alignment, and in-character consistency.