OlympicArena is a benchmark to evaluate advanced capabilities of language models across a broad spectrum of Olympic-level challenges.

Comprehensive:

The benchmark includes a comprehensive set of 11,163 problems from 62 distinct Olympic competitions, structured with 13 answer types. It spans seven core disciplines: mathematics, physics, chemistry, biology, geography, astronomy, and computer science, encompassing a total of 34 specialized branches

High-challenging:

The benchmark focuses on Olympic-level problems and covers 8 types of logical reasoning abilities and 5 types of visual reasoning abilities.

Rigorous:

Given the increasing scale of pre-training corpora, it is crucial to detect potential benchmark leakage. We employ a recently proposed instance-level leakage detection metric to validate our benchmark’s effectiveness.

Fine-grained Evaluation:

We conduct comprehensive evaluations from both the answer-level and process-level perspectives. Additionally, we perform fine-grained evaluations and analyses on different types of cognitive reasoning, from both logical and visual perspectives to better interpret the current capabilities of AI.