TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

3,148 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

3,148 dataset results

AppealCase

The AppealCase dataset is the first large-scale resource specifically designed to support LegalAI research in appellate judgment scenarios. While prior work in LegalAI has focused heavily on one-shot trials, the appellate procedure—critical to ensuring fairness and correcting judicial errors—remains largely underexplored.

1 papers0 benchmarksTexts

Among Them (Among Them dialogs and persuasion labels)

The dataset contains dialogs of different LLMs from the discussion phase of a text-based Among Us-like game. The phrases in the dataset were annotated according to 25 selected persuasion techniques: Appeal to Logic, Appeal to Emotion, Appeal to Credibility, Shifting the Burden of Proof, Bandwagon Effect, Distraction, Gaslighting, Appeal to Urgency, Deception, Lying, Feigning Ignorance, Vagueness, Minimization, Self-Deprecation, Projection, Appeal to Relationship, Humor, Sarcasm, Withholding Information, Exaggeration, Denial without Evidence, Strategic Voting Suggestion, Appeal to Rules, Confirmation Bias Exploitation, Information Overload. The annotation was performed automatically by few-shot prompting a Gemini Flash 1.5 model with a temperature of 0. On a random sample of 11 games involving a total of 509 persuasion tags, Krippendorff's alpha inter-rater agreement between human annotations and the persuasion tagger was 0.56. For the definitions of the persuasion techniques, please re

1 papers0 benchmarksTexts

KCIF (Knowledge Conditioned Instruction Following (KCIF))

KCIF is a benchmark for evaluating the instruction-following capabilities of Large Language Models (LLM). We adapt existing knowledge benchmarks and augment them with instructions that are a) conditional on correctly answering the knowledge task or b) use the space of candidate options in multiple-choice knowledge-answering tasks. KCIF allows us to study model characteristics, such as their change in performance on the knowledge tasks in the presence of answer-modifying instructions and distractor instructions.

1 papers0 benchmarksTexts

pick_screw

In this dataset we teleoperated UR5 arm to collect manipulation data for picking up a screwdriver in a cluttered tabletop environment.

1 papers0 benchmarks6D, RGB-D, Texts

MathEquiv (mathematical statement equivalence)

MathEquiv dataset is accompanied to EquivPruner . It is specifically designed for mathematical statement equivalence , serving as a versatile resource applicable to a variety of mathematical tasks and scenarios. It consists of almost 100k math sentences pair with equivalence result and reasoning step generated by GPT-4O.

1 papers0 benchmarksTexts

JamendoMaxCaps

📊 Dataset Details

1 papers0 benchmarksAudio, Music, Texts

CRUST-bench

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTexts

BAH (Behavioural Ambivalence/Hesitancy)

Recognizing complex emotions linked to ambivalence and hesitancy (A/H) can play a critical role in the personalization and effectiveness of digital behaviour change interventions. These subtle and conflicting emotions are manifested by a discord between multiple modalities, such as facial and vocal expressions, and body language. Although experts can be trained to identify A/H, integrating them into digital interventions is costly and less effective. Automatic learning systems provide a cost-effective alternative that can adapt to individual users, and operate seamlessly within real-time, and resource-limited environments. However, there are currently no datasets available for the design of ML models to recognize A/H.

1 papers0 benchmarksAudio, Texts, Videos

OpenS2V-5M

We create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. To ensure subject-information diversity in our dataset by, we (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image on raw frames to synthesize multi-view representations. The dataset supports both Subject-to-Video and Text-to-Video generation tasks.

1 papers0 benchmarksImages, Texts, Videos

SimpleStories

SimpleStories is a dataset of >2 million model-generated short stories. It was made to train small, interpretable language models on it. The generation process is open-source: To see how the dataset was generated, or to generate some stories yourself, head over to https://github.com/lennart-finke/simple_stories_generate.

1 papers0 benchmarksTexts

Protein-Instructions-OOD

OOD split of the Mol-Instructions Dataset about Protein Annotation.

1 papers0 benchmarksTexts

MF3QA (Medical Free Form Farsi Question Answering dataset)

real-world doctor-patient question- answering dataset cleaned manually and automatically

1 papers0 benchmarksTexts

MF3QA_uncleaned (Medical Free Form Farsi Question Answering dataset (uncleaned))

real-world doctor-patient question- answering dataset

1 papers0 benchmarksTexts

CPMC (crawled persian medical corpus)

a 90 million token medical corpus crawled from medical websites

1 papers0 benchmarksTexts

K-QA(fa) (persian translation of K-QA dataset)

persian translation of K-QA dataset

1 papers0 benchmarksTexts

WebGen-Bench

WebGen-Bench WebGen-Bench is created to benchmark LLM-based agent's ability to generate websites from scratch. The dataset is introduced in WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch. It contains 101 instructions and 647 test cases. It also has a training set of 6667 instructions, named WebGen-Instruct.

1 papers0 benchmarksImages, Texts

Znaki

The first and the one open dataset for Russian finger- spelling, contained 1,593 annotated phrases and over 37 thousand HD+ videos.

1 papers1 benchmarksImages, Texts, Videos

PubMedQA corpus with metadata

PubMedQA-MetaGen: Metadata-Enriched PubMedQA Corpus

1 papers5 benchmarksTexts

ConstructiveBench

Enumerate–Conjecture–Prove: Formally Solving Answer-Construction Problem in Math Competitions We release the ConstructiveBench dataset as part of our Enumerate–Conjecture–Prove (ECP) paper. It enables benchmarking automated reasoning systems on answer-construction math problems using Lean 4.

1 papers0 benchmarksTexts

ILSP Greek Evaluation Suite

A collection of test sets for evaluating base and chat LLMs (incl. VLMs) on Greek generation and understanding capabilities.

1 papers0 benchmarksImages, Texts
PreviousPage 151 of 158Next