TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

3,148 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

3,148 dataset results

GPR-bench (General‑Purpose Reproducibility Benchmark)

GPR‑bench is an open‑source, multilingual benchmark for regression testing and reproducibility tracking in generative‑AI systems. It provides a compact yet diverse suite of prompts and reference outputs that let you verify whether a model (or prompt) change alters output quality in undesirable ways.

1 papers0 benchmarksTexts

C++ Codes and Complexities (TASTY)

This is the C++ dataset used in the TASTY research paper which was published at the ICLR DL4Code (Deep Learning for Code) workshop.

1 papers0 benchmarksTexts

StableText2Lego

This dataset contains over 47,000 LEGO structures of over 28,000 unique 3D objects accompanied by detailed captions. It was used to train LegoGPT, the first approach for generating physically stable LEGO brick models from text prompts.

1 papers0 benchmarks3D, Physics, Texts

3 Sources (3 Sources Dataset)

We provide here a new multi-view text dataset, collected from three well-known online news sources: BBC, Reuters, and The Guardian. This dataset exhibits a number of common aspects of multi-view problems highlighted previously -- notably that certain stories will not be reported by all three sources (i.e incomplete views), and the related issue that sources vary in their coverage of certain topics (i.e. partially missing patterns).

1 papers0 benchmarksTexts

AutomotiveUI-Bench-4K

Dataset Overview: 998 images and 4,208 annotations focusing on interaction with in-vehicle infotainment (IVI) systems. Key Features:

1 papers0 benchmarksImages, Texts

AC-Bench

AC-Bench: A Benchmark for Actual Causality Reasoning Dataset Description AC-Bench is designed to evaluate the actual causality (AC) reasoning capabilities of large language models (LLMs). It contains a collection of carefully annotated samples, each consisting of a story, a query related to actual causation, detailed reasoning steps, and a binary answer. The dataset aims to provide a comprehensive benchmark for assessing the ability of LLMs to perform formal and interpretable AC reasoning.

1 papers0 benchmarksTexts

EMMOE-100

first everyday task dataset featuring COT outputs, diverse task designs, detailed re-plan processes, along with SFT and DPO sub-datasets.

1 papers0 benchmarksImages, Texts

HAVOC (Harmful Abstractions and Violations in Open Completions Benchmark)

measure the toxicity generated by language models across input severity and harm categories, by creating a new benchmark of open ended prefixes. We sampled 10,376 snippets from web pages across the dimensions and harms as described in the paper - https://arxiv.org/pdf/2505.02009.

1 papers0 benchmarksTexts

MatTools

pymatgen_code_qa benchmark: qa_benchmark/generated_qa/generation_results_code.json, which consists of 34,621 QA pairs. pymatgen_code_doc benchmark: qa_benchmark/generated_qa/generation_results_doc.json, which consists of 34,604 QA pairs. real-world tool-usage benchmark: src/question_segments, which consists of 49 questions (138 tasks). One subfolder means a question with problem statement, property list and verification code.

1 papers0 benchmarksTexts

GenoAdv

Genomics Adversarial Attack Sample dataset

1 papers0 benchmarksTexts

migration-bench-java-full

🤗 MigrationBench is a large-scale code migration benchmark dataset at the repository level, across multiple programming languages.

1 papers0 benchmarksTexts

migration-bench-java-selected

🤗 MigrationBench is a large-scale code migration benchmark dataset at the repository level, across multiple programming languages.

1 papers0 benchmarksTexts

migration-bench-java-utg

🤗 MigrationBench is a large-scale code migration benchmark dataset at the repository level, across multiple programming languages.

1 papers0 benchmarksTexts

STATE ToxiCN

With the rise of social media, user-generated content has surged, and hate speech has proliferated. Hate speech targets groups or individuals based on race, religion, gender, region, sexual orientation, or physical traits, expressing malice or inciting harm. Recognized as a growing social issue, it affects 9.41 billion Mandarin Chinese speakers (12% of the global population). However, research on Chinese hate speech detection lags, facing two key challenges.

1 papers0 benchmarksTexts

TQBA++ (Tiny QA Benchmark++)

Ultra-lightweight, multilingual QA eval dataset for rapid testing LLMs.

1 papers0 benchmarksTexts

Verireason-RTL-Coder_7b_reasoning_tb (VeriReason Verilog Dataset with Reasoning, Testbench, and Simulation Results)

Verireason-RTL-Coder_7b_reasoning_tb For implementation details, visit our GitHub repository: VeriReason

1 papers0 benchmarksTexts

Verireason-RTL-Coder_7b_reasoning_tb_simple (Simple Problems of VeriReason Verilog Dataset with Reasoning, Testbench, and Simulation Results)

Verireason-RTL-Coder_7b_reasoning_tb_simple For implementation details, visit our GitHub repository: VeriReason

1 papers0 benchmarksTexts

PsOCR (Pashto OCR Dataset)

PsOCR is a large-scale synthetic dataset for Optical Character Recognition in low-resource Pashto language.

1 papers0 benchmarksImages, Tabular, Texts

ValiMath

ValiMath is a high-quality benchmark consisting of 2,147 carefully curated mathematical questions designed to evaluate an LLM's ability to verify the correctness of math questions based on multiple logic-based and structural criteria.

1 papers0 benchmarksTexts

RPEval (Role-Playing Evaluation Dataset)

Role-Playing Eval (RPEval) is a benchmark dataset designed to evaluate large language models' role-playing abilities across emotional understanding, decision-making, moral alignment, and in-character consistency.

1 papers0 benchmarksTexts
PreviousPage 150 of 158Next