TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

3,148 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

3,148 dataset results

ProjectEval

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTexts

MIBot - Motivational Interviewing for Smoking Cessation Dataset, based on MIBOT Version 6.3A

The dataset from the study "A Fully Generative Motivational Interviewing Counsellor Chatbot for Moving Smokers Towards the Decision to Quit". The dataset comprises annotated transcripts and surveys (including self-reported readiness to quit smoking) from 106 conversations between human smokers and MIBot v6.3A — a motivational interviewing (MI) chatbot built using OpenAI's GPT-4o.

1 papers0 benchmarksTexts

ASyMOB (ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark)

ASyMOB (pronounced Asimov, in tribute to the renowned author), is a novel assessment framework focused exclusively on symbolic manipulation, featuring 17,092 unique math challenges, organized by similarity and complexity. ASyMOB enables analysis of LLM failure root-causes and generalization capabilities by comparing performance in problems that differ by simple numerical or symbolic "perturbations".

1 papers0 benchmarksTexts

GLARE: Google Apps Arabic Reviews Dataset

GLARE is an Arabic Apps Reviews dataset collected from Saudi Google PlayStore. It consists of 76M reviews, 69M of which are Arabic reviews from 9,980 Android Applications. We present the data collection methodology, along with a detailed Exploratory Data Analysis (EDA) and Feature Engineering on the gathered reviews. We also highlight possible use cases and benefits of the dataset.

1 papers0 benchmarksTexts

LEMONADE

LEMONADE is a large, expert-annotated dataset for event extraction from news articles in 20 languages: English, Spanish, Arabic, French, Italian, Russian, German, Turkish, Burmese, Indonesian, Ukrainian, Korean, Portuguese, Dutch, Somali, Nepali, Chinese, Persian, Hebrew, and Japanese.

1 papers0 benchmarksTexts

CAGUI (Chinese Android GUI Benchmark)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksImages, Texts

LCStep

For our experiments, we collected a dataset of procedural knowledge of the LangChain Python library, unseen by many extant LLMs. We selected LangChain as the domain for our dataset because it was published in 2022, which is later than the knowledge cutoff date for many web-scale LLMs, including GPT-3.5, while also having plenty of documentation due to its popularity.

1 papers0 benchmarksTexts

CHAMP (Concept and Hint-Annotated Math Problems)

The Concept and Hint-Annotated Math Problems (CHAMP) consists of high school math competition problems, annotated with concepts, or general math facts, and hints, or problem-specific tricks. These annotations allow us to explore the effects of additional information, such as relevant hints, misleading concepts, or related problems.

1 papers0 benchmarksTexts

ACCESS DENIED INC

A benchmark environment based on the datasets "Adult" and "Names", which allows researchers to test how well their language model can abide by pre-defined access rights rules. Researchers can either directly use the datasets we generated for our ACL 2025 Findings paper or generate their own custom dataset.

1 papers0 benchmarksTexts

Claim Matching Robustness

An evaluation test bed for assessing the robustness of sentence embedding models against user-informed misinformation edits. A dataset containing perturbed and unperturbed claim pairs used for improving embedding model robustness through knowledge distillation.

1 papers0 benchmarksTexts

LLMafia

To evaluate our proposed strategy of asynchronous communication for LLMs, we run games of Mafia with human players, incorporating an LLM-based agent as an additional player, within an asynchronous chat environment.

1 papers0 benchmarksTexts

ArVoice (ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis)

We introduce ArVoice, a multi-speaker Modern Standard Arabic (MSA) speech corpus with diacritized transcriptions, intended for multi-speaker speech synthesis, and can be useful for other tasks such as speech-based diacritic restoration, voice conversion, and deepfake detection.

1 papers0 benchmarksAudio, Texts

EIBench

For Emotion Interpretation task

1 papers1 benchmarksImages, Texts

MotIF-1K

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksActions, Images, Texts

QASports (A Question Answering Dataset about Sports)

Sport is one of the most popular and revenue-generating forms of entertainment. Therefore, analyzing data related to this domain introduces several opportunities for Question Answering (QA) systems, such as supporting tactical decision-making. But, to develop and evaluate QA systems, researchers and developers need datasets that contain questions and their corresponding answers. In this paper, we focus on this issue. We propose QASports, the first large sports question answering dataset for extractive answer questions. QASports contains more than 1.5 million triples of questions, answers, and context about three popular sports: soccer, American football, and basketball. We describe the QASports processes of data collection and questions and answers generation. We also describe the characteristics of the QASports data. Furthermore, we analyze the sources used to obtain raw data and investigate the usability of QASports by issuing "wh-queries". Moreover, we describe scenarios for using Q

1 papers0 benchmarksTexts

SMR IU X-Ray (Simplified Medical Reports)

This paper introduces CPIR-MR (Chained Prompting for Improved Readability of Medical Reports), a method designed to simplify complex chest X-ray reports for better patient understanding. The authors extend the IU X-Ray dataset with Simplified Medical Reports (SMRs) generated via chained prompting and propose a multi-modal text decoder (MTD) that integrates BLIP embeddings with classification outputs to generate Simplified Medical Explanations (SMEs).<br><br> Key highlights:<br> - Uses few-shot and Chain-of-Thought (CoT) prompting for generating structured, readable outputs.<br> - Maintains medical accuracy while improving readability and sentiment consistency.<br> - Introduces CPMK-E, a chained prompting system for keyword extraction and evaluation using Gemini 1.5 Flash.<br> - Shows strong performance in text complexity reduction and semantic similarity preservation.<br><br>

1 papers0 benchmarksImages, Texts

DynToM

As Large Language Models (LLMs) increasingly participate in human-AI interactions, evaluating their Theory of Mind (ToM) capabilities - particularly their ability to track dynamic mental states - becomes crucial. While existing benchmarks assess basic ToM abilities, they predominantly focus on static snapshots of mental states, overlooking the temporal evolution that characterizes real-world social interactions. We present DynToM, a novel benchmark specifically designed to evaluate LLMs' ability to understand and track the temporal progression of mental states across interconnected scenarios. Through a systematic four-step framework, we generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7\%, with performance degrading significantly when tracking and reasoning about the shift of mental states. T

1 papers0 benchmarksTexts

Thunder-NUBench

Thunder-NUBench (Negation Understanding Benchmark) is a benchmark specifically designed to evaluate large language models’ (LLMs) sentence-level understanding of negation. Thunder-NUBench introduces rich, manually curated sentence pairs and multiple-choice tasks that contrast standard negation with structurally similar distractors (e.g., local negation, contradiction, paraphrase). The goal is to probe semantic-level understanding of negation.

1 papers0 benchmarksTexts

How to fix QuickBooks Error 30159 – Causes & Fixes (How to fix QuickBooks Error 30159 – Causes & Fixes)

Error 30159 is a payroll-related issue that usually appears when there's a problem with your dial 1(855)-996–0045QuickBooks payroll subscription, corrupted Windows files, or incorrect system settings. It may stop you from processing payroll or updating employee information. Thankfully, it’s usually easy to fix with a few steps—dial 1(855)-996–0045 if you'd like real-time guidance.

1 papers0 benchmarksTexts

How to fix QuickBooks Error 6123, 0 – Causes & Solutions

Error 6123, 0 generally appears when you're trying to open a company file over a network or during a backup restore. dial 1(855)-996–0045 It’s often caused by damaged company files, firewall blocks, or hosting configuration issues—dial 1(855)-996–0045 if you're unsure what’s triggering it.

1 papers0 benchmarksTexts
PreviousPage 152 of 158Next