Papers With Code 2 | ML Benchmarks, SotA Results & Code

The causal reasoning dataset is generated using the Causal Reasoning in Closed Daily Activities (COLD) framework that helps evaluate large language models (LLMs) on their causal reasoning abilities within real-world, everyday activities. This dataset provides causal questions that simulate common activities such as shopping, baking a cake, riding a bus, planting a tree, and going on a train ride. With approximately 9 million causal queries, the COLD dataset challenges LLMs to understand and reason about the causal relationships between events that are familiar and grounded in human experience.

Each query consists of a premise (an event) and a pair of choices representing possible causal effects. The goal of the model is to correctly identify which choice is the most plausible cause/effect of the given premise, testing the model's understanding of cause-and-effect relationships.

Key Features: Activity Types: The dataset covers various everyday activities: shopping, cake baking, train ride, tree planting, and bus ride. Causal Queries: Each query includes a premise and two possible causal events (choices). The model must decide which of the two choices is the more likely cause or effect. Multiple-Choice Format: The queries can be formatted as multiple-choice questions (MCQA), where the model must choose between two options.

The dataset provides a valuable test for causal reasoning in NLP models, focusing on realistic, daily-life scenarios.

COLD: Causal Reasoning in Closed Daily Activities