Test of Time
ToT is a benchmark for evaluating LLMs on temporal reasoning.
ToT is a dataset designed to assess the temporal reasoning capabilities of AI models. It comprises two key sections:
ToT-semantic: Measuring the semantics and logic of time understanding. ToT-arithmetic: Measuring the ability to carry out time arithmetic operations.
Data Format The ToT-semantic and ToT-semantic-large datasets contain the following fields:
question: Contains the text of the question. graph_gen_algorithm: Contains the name of the graph generator algorithm used to generate the graph. question_type: Corresponds to one of the 7 question types in the dataset. sorting_type: Correspons to the sorting type applied on the facts to order them. prompt: Contains the full prompt text used to evaluate LLMs on the task. label: Contains the ground truth answer to the question. The ToT-arithmetic dataset contains the following fields:
question: Contains the text of the question. question_type: Corresponds to one of the 7 question types in the dataset. label: Contains the ground truth answer to the question.