Datasets

3,148 machine learning datasets

3,148 dataset results

COLD: Causal Reasoning in Closed Daily Activities

The causal reasoning dataset is generated using the Causal Reasoning in Closed Daily Activities (COLD) framework that helps evaluate large language models (LLMs) on their causal reasoning abilities within real-world, everyday activities. This dataset provides causal questions that simulate common activities such as shopping, baking a cake, riding a bus, planting a tree, and going on a train ride. With approximately 9 million causal queries, the COLD dataset challenges LLMs to understand and reason about the causal relationships between events that are familiar and grounded in human experience.

1 papers0 benchmarksTexts

PlainFact

PlainFact is a high-quality human-annotated dataset with fine-grained explanation (i.e., added information) annotations.

1 papers0 benchmarksMedical, Texts

opencl-llmperf

A collection of datasets and benchmarks for large-scale Performance Modeling with LLMs.

1 papers0 benchmarksTabular, Texts

Social Media Messages for Early Cyberattack Detection on Blockchain

ELTEX-Blockchain: A Domain-Specific Dataset for Cybersecurity 🔐 12k Synthetic Social Media Messages for Early Cyberattack Detection on Blockchain

1 papers0 benchmarksTexts

Reddit-Model-Hubs

https://arxiv.org/abs/2503.15222

1 papers0 benchmarksTexts

NCSE v2.0 (NCSE v2.0: A Dataset of OCR-Processed 19th Century English Newspapers)

The NCSE v2.0 is a digitized collection of six 19th-century English periodicals

1 papers0 benchmarksImages, Texts

BLN600 (BLN600: A Parallel Corpus of Machine/Human Transcribed Nineteenth Century Newspaper Texts)

A publicly available corpus of nineteenth-century newspaper text focused on crime in London, derived from the Gale British Library Newspapers corpus parts 1 and 2. The corpus comprises 600 newspaper excerpts and for each excerpt contains the original source image, the machine transcription of that image as found in the BLN and a gold standard manual transcription.

1 papers0 benchmarksImages, Texts

VMD (Virtual Moderation Dataset)

This dataset contains synthetically generated discussions and annotations using exclusively Large Language Model (LLM) agents. Discussions are performed between randomly selected users, with a LLM moderator/facilitator following various facilitation strategies.

1 papers0 benchmarksTexts

GroundCap

GroundCap is a novel grounded image captioning dataset derived from MovieNet, containing 52,350 movie frames with detailed grounded captions. The dataset uniquely features an ID-based system that maintains object identity throughout captions, enables tracking of object interactions, and grounds not only objects but also actions and locations in the scene.

1 papers0 benchmarksImages, Texts

LLM Health Benchmarks (LLM Health Benchmarks - Yesil Science)

LLM Health Benchmarks Dataset The Health Benchmarks Dataset is a specialized resource for evaluating large language models (LLMs) in different medical specialties. It provides structured question-answer pairs designed to test the performance of AI models in understanding and generating domain-specific knowledge.

1 papers0 benchmarksMedical, Texts

LSDBench (Long-video Sampling Dilemma Benchmark)

A benchmark that focuses on the sampling dilemma in long-video tasks. The LSDBench dataset is designed to evaluate the sampling efficiency of long-video VLMs. It consists of multiple-choice question-answer pairs based on hour-long videos, focusing on dense and short-duration actions with high Necessary Sampling Density (NSD).

1 papers0 benchmarksActions, Images, Texts, Videos

GeoJEPAD (GeoJEPA Dataset)

GeoJEPAD is a multimodal dataset combining OpenStreetMap (OSM) data (attributes and geometries) with high-resolution aerial imagery from diverse urban areas.   Sourced from NAIP and OSM and then processed, tiled, and cropped. Geometries and relations represented as graphs with optional visibility edges.

1 papers0 benchmarksGraphs, Images, Texts

MAQA (Multi-Answer Question Answering dataset)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTexts

Votranh DREAM_LOG (Votranh V8 – Dream Log Dataset)

Votranh DREAM_LOG is a poetic, philosophical dataset generated by the self-evolving AI system Votranh V8. Rather than typical NLP benchmarks, this dataset contains narrative dreams, emotional reflections, and meditative monologues written autonomously by the AI.

1 papers0 benchmarksTexts

LLaVA-Rad MIMIC-CXR Annotations

LLaVA-Rad MIMIC-CXR features more accurate section extractions from MIMIC-CXR free-text radiology reports. Traditionally, rule-based methods were used to extract sections such as the reason for exam, findings, and impression. However, these approaches often fail due to inconsistencies in report structure and clinical language. In this work, we leverage GPT-4 to extract these sections more reliably, adding 237,073 image-text pairs to the training split and 1,952 pairs to the validation split. This enhancement afforded the development and fine-tuning of LLaVA-Rad, a multimodal large language model (LLM) tailored for radiology applications, achieving improved performance on report generation tasks.

1 papers0 benchmarksImages, Medical, Texts

SUDO Dataset

SUDO is a benchmark of 50 real-world malicious tasks designed to evaluate LLM-based computer agents in live desktop and web environments. It covers critical risk domains—including system security, content safety, societal harms, and privacy violations—based on the AirBench taxonomy. The dataset supports fine-grained evaluation using task-specific checklists and can be used to assess model misuse potential, build safer agents, or guide alignment research.

1 papers1 benchmarksTexts

COFFE (COFFE: A Code Efficiency Benchmark for Code Generation)

COFFE COFFE is a Python benchmark for evaluating the time efficiency of LLM-generated code. It is released by the FSE'25 paper "COFFE: A Code Efficiency Benchmark for Code Generation". You can also refer to the project webpage for more details.

1 papers0 benchmarksTexts

ViLCo (ViLCo-Bench)

We propose the first standardized benchmark in multimodal continual learning for video data, defining protocols for training and metrics for evaluation. This standardized framework allows researchers to effectively compare models, driving advancements in AI systems that can continuously learn from diverse data sources.

1 papers0 benchmarksImages, Texts, Videos

Vis-CheBI20

Molecules represent tokens of the language of chemistry, which underlies not only chemistry itself, but also scientific fields that use chemical information such as pharmacy, material science, and molecular biology. Existing molecular information is distributed across text books, publications, and patents. To describe structural information (spatial arrangement of atoms), molecules are commonly drawn as 2D images in such documents, which makes Optical Chemical Structure Understanding (OCSU) play an important role in molecule-centric scientific discovery.

1 papers0 benchmarksImages, Texts

FABA-Instruct

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksActions, Images, Texts

PreviousPage 148 of 158Next