Mathematical Formulas
Mathematical dataset containing formulas based on the AMPS Khan dataset and the ARQMath dataset V1.3. Based on the retrieved LaTeX formulas, more equivalent versions have been generated by applying randomized LaTeX printing with this SymPy fork. The formulas are intended to be well applicable for MLM. For instance, a masking for a formula like (a+b)^2 = a^2 + 2ab + b^2 makes sense (e.g., (a+[MASK])^2 = a^2 + [MASK]ab + b[MASK]2 -> masked tokens are deducable by the context), in contrast, formulas such as f(x) = 3x+1 are not (e.g., [MASK](x) = 3x[MASK]1 -> [MASK] tokens are ambigious).