Papers With Code 2 | ML Benchmarks, SotA Results & Code

RLM25 is an evaluation benchmark containing 619 paired examples of research-level natural language mathematical statements and their corresponding Lean formalizations. The examples are drawn from six real-world formalization projects, and each entry includes essential context along with timestamp information to ensure freshness and avoid data contamination. Its primary purpose is to serve as a challenging testbed for assessing autoformalization systems, helping researchers measure improvements in translating complex, research-level mathematics into formal code.