Papers With Code 2 | ML Benchmarks, SotA Results & Code

MathBench is an All in One math dataset for language model evaluation, with:

A Sophisticated Five-Stage Difficulty Mechanism: Unlike the usual mathematical datasets that can only evaluate a single difficulty level or have a mix of unclear difficulty levels, MathBench provides 3709 questions with a gradient difficulty division by education stages, ranging from basic arithmetic to primary, middle, high school, and college levels, allowing you to get a clear overview of the comprehensive difficulty evaluation results.

Bilingual Gradient Evaluation: Apart from the basic calculation part which is language-independent, MathBench provides questions in both Chinese and English for the four-stage difficulty datasets from primary to college.

Implementation of the Robust Circular Evaluation (CE) Method: MathBench use CE as the main evaluation method for questions. Compared to the general Accuracy evaluation method, CE requires the model to answer the same multiple-choice question multiple times, with the order of the options changing each time. The model is considered correct on this question only if all answers are correct. The results of CE can reflect the model's capabilities more realistically, providing more valuable evaluation results.

Support for Basic Theory Questions: For every stage, MathBench provides questions that cover the basic theory knowledge points of the corresponding stage, to ascertain whether the model has genuinely mastered the fundamental concepts of each stage or merely memorized the answers.