Papers With Code 2 | ML Benchmarks, SotA Results & Code

BenBench is designed to benchmark the potential for data leakage in benchmark datasets, which can lead to biased and inequitable comparisons. In this work, we are not pursuing technical contributions in system development; instead, we are attempting to encourage the healthy development of this field, particularly through the lens of mathematical reasoning tasks, in the following aspects:

Summaries of various pre-training behaviors and challenges for detecting benchmark leakage;
Proposal of a detection pipeline for estimating pre-training behaviors;
Leakage analysis of existing models;
Recommendation for model documentation (i.e., introducing Benchmark Transparency Card), benchmark setup and future evaluations.