Papers With Code 2 | ML Benchmarks, SotA Results & Code

MathEquiv dataset is accompanied to EquivPruner . It is specifically designed for mathematical statement equivalence , serving as a versatile resource applicable to a variety of mathematical tasks and scenarios. It consists of almost 100k math sentences pair with equivalence result and reasoning step generated by GPT-4O.

The dataset consists of three splits:

train with 77.6k problems for training.
test with 9.83k samples for testing.
valid with 9.75k samples for validation.

We implemented a five-tiered classification system. This granular approach was adopted to enhance the stability of the GPT model's outputs, as preliminary experiments with binary classification (equivalent/non-equivalent) revealed inconsistencies in judgments. The five-tiered system yielded significantly more consistent and reliable assessments:

Level 4 (Exactly Equivalent): The statements are mathematically interchangeable in all respects, exhibiting identical meaning and form.
Level 3 (Likely Equivalent): Minor syntactic differences may be present, but the core mathematical content and logic align.
Level 2 (Indeterminable): Insufficient information is available to make a definitive judgment regarding equivalence.
Level 1 (Unlikely Equivalent): While some partial agreement may exist, critical discrepancies in logic, definition, or mathematical structure are observed.
Level 0 (Not Equivalent): The statements are fundamentally distinct in their mathematical meaning, derivation, or resultant outcomes.