3,148 machine learning datasets
3,148 dataset results
The causal reasoning dataset is generated using the Causal Reasoning in Closed Daily Activities (COLD) framework that helps evaluate large language models (LLMs) on their causal reasoning abilities within real-world, everyday activities. This dataset provides causal questions that simulate common activities such as shopping, baking a cake, riding a bus, planting a tree, and going on a train ride. With approximately 9 million causal queries, the COLD dataset challenges LLMs to understand and reason about the causal relationships between events that are familiar and grounded in human experience.
PlainFact is a high-quality human-annotated dataset with fine-grained explanation (i.e., added information) annotations.
A collection of datasets and benchmarks for large-scale Performance Modeling with LLMs.
ELTEX-Blockchain: A Domain-Specific Dataset for Cybersecurity đ 12k Synthetic Social Media Messages for Early Cyberattack Detection on Blockchain
https://arxiv.org/abs/2503.15222
The NCSE v2.0 is a digitized collection of six 19th-century English periodicals
A publicly available corpus of nineteenth-century newspaper text focused on crime in London, derived from the Gale British Library Newspapers corpus parts 1 and 2. The corpus comprises 600 newspaper excerpts and for each excerpt contains the original source image, the machine transcription of that image as found in the BLN and a gold standard manual transcription.
This dataset contains synthetically generated discussions and annotations using exclusively Large Language Model (LLM) agents. Discussions are performed between randomly selected users, with a LLM moderator/facilitator following various facilitation strategies.
GroundCap is a novel grounded image captioning dataset derived from MovieNet, containing 52,350 movie frames with detailed grounded captions. The dataset uniquely features an ID-based system that maintains object identity throughout captions, enables tracking of object interactions, and grounds not only objects but also actions and locations in the scene.
LLM Health Benchmarks Dataset The Health Benchmarks Dataset is a specialized resource for evaluating large language models (LLMs) in different medical specialties. It provides structured question-answer pairs designed to test the performance of AI models in understanding and generating domain-specific knowledge.
A benchmark that focuses on the sampling dilemma in long-video tasks. The LSDBench dataset is designed to evaluate the sampling efficiency of long-video VLMs. It consists of multiple-choice question-answer pairs based on hour-long videos, focusing on dense and short-duration actions with high Necessary Sampling Density (NSD).
GeoJEPAD is a multimodal dataset combining OpenStreetMap (OSM) data (attributes and geometries) with high-resolution aerial imagery from diverse urban areas.â¨â¨ Sourced from NAIP and OSM and then processed, tiled, and cropped. Geometries and relations represented as graphs with optional visibility edges.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Votranh DREAM_LOG is a poetic, philosophical dataset generated by the self-evolving AI system Votranh V8. Rather than typical NLP benchmarks, this dataset contains narrative dreams, emotional reflections, and meditative monologues written autonomously by the AI.
LLaVA-Rad MIMIC-CXR features more accurate section extractions from MIMIC-CXR free-text radiology reports. Traditionally, rule-based methods were used to extract sections such as the reason for exam, findings, and impression. However, these approaches often fail due to inconsistencies in report structure and clinical language. In this work, we leverage GPT-4 to extract these sections more reliably, adding 237,073 image-text pairs to the training split and 1,952 pairs to the validation split. This enhancement afforded the development and fine-tuning of LLaVA-Rad, a multimodal large language model (LLM) tailored for radiology applications, achieving improved performance on report generation tasks.
SUDO is a benchmark of 50 real-world malicious tasks designed to evaluate LLM-based computer agents in live desktop and web environments. It covers critical risk domainsâincluding system security, content safety, societal harms, and privacy violationsâbased on the AirBench taxonomy. The dataset supports fine-grained evaluation using task-specific checklists and can be used to assess model misuse potential, build safer agents, or guide alignment research.
COFFE COFFE is a Python benchmark for evaluating the time efficiency of LLM-generated code. It is released by the FSE'25 paper "COFFE: A Code Efficiency Benchmark for Code Generation". You can also refer to the project webpage for more details.
We propose the first standardized benchmark in multimodal continual learning for video data, defining protocols for training and metrics for evaluation. This standardized framework allows researchers to effectively compare models, driving advancements in AI systems that can continuously learn from diverse data sources.
Molecules represent tokens of the language of chemistry, which underlies not only chemistry itself, but also scientific fields that use chemical information such as pharmacy, material science, and molecular biology. Existing molecular information is distributed across text books, publications, and patents. To describe structural information (spatial arrangement of atoms), molecules are commonly drawn as 2D images in such documents, which makes Optical Chemical Structure Understanding (OCSU) play an important role in molecule-centric scientific discovery.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).