NMF
Named Mathematical Formulas
Mathematical dataset based on 71 famous mathematical identities. Each entry consists of a name of the identity (name), a representation of that identity (formula), a label whether the representation belongs to the identity (label), and an id of the mathematical identity (formula_name_id).
The false pairs are intentionally challenging, e.g., a^2+2^b=c^2as falsified version of the Pythagorean Theorem. All entries have been generated by using data.json as starting point and applying the randomizing and falsifying algorithms from MathMutator (MAMUT). The formulas in the dataset are not just pure mathematical, but contain also textual descriptions of the mathematical identity. At most 400000 versions are generated per identity. There are ten times more falsified versions than true ones, such that the dataset can be used for a training with changing false examples every epoch.
This dataset might be used to train a language model based math-retrieval system with a NSP-like task.
See also see ddrg/math_formula_retrieval as a derived dataset which associates two formulas.