Papers With Code 2 | ML Benchmarks, SotA Results & Code

ToT is a benchmark for evaluating LLMs on temporal reasoning.

ToT is a dataset designed to assess the temporal reasoning capabilities of AI models. It comprises two key sections:

ToT-semantic: Measuring the semantics and logic of time understanding. ToT-arithmetic: Measuring the ability to carry out time arithmetic operations.

Data Format The ToT-semantic and ToT-semantic-large datasets contain the following fields:

question: Contains the text of the question. graph_gen_algorithm: Contains the name of the graph generator algorithm used to generate the graph. question_type: Corresponds to one of the 7 question types in the dataset. sorting_type: Correspons to the sorting type applied on the facts to order them. prompt: Contains the full prompt text used to evaluate LLMs on the task. label: Contains the ground truth answer to the question. The ToT-arithmetic dataset contains the following fields:

question: Contains the text of the question. question_type: Corresponds to one of the 7 question types in the dataset. label: Contains the ground truth answer to the question.

ToT is a benchmark for evaluating LLMs on temporal reasoning.

ToT is a dataset designed to assess the temporal reasoning capabilities of AI models. It comprises two key sections:

ToT-semantic: Measuring the semantics and logic of time understanding. ToT-arithmetic: Measuring the ability to carry out time arithmetic operations.

Data Format The ToT-semantic and ToT-semantic-large datasets contain the following fields:

question: Contains the text of the question. question_type: Corresponds to one of the 7 question types in the dataset. label: Contains the ground truth answer to the question.

ToT

Related Benchmarks

ToT

Related Benchmarks