Qwen2(CoT + Code Interpreter)

Reported on 4 benchmarks across 4 tasks

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Knowledge Base2 results

Mathematical Question AnsweringonSVAMP
Execution Accuracy
92.3
best: 93.9 (GPT-4 (Teaching-Inspired))
Mathematical ReasoningonSVAMP
Execution Accuracy
92.3
best: 93.9 (GPT-4 (Teaching-Inspired))

Natural Language Processing1 result

Question AnsweringonSVAMP
Execution Accuracy
92.3
best: 93.9 (GPT-4 (Teaching-Inspired))

Reasoning1 result

Math Word Problem SolvingonSVAMP
Execution Accuracy
92.3
best: 93.9 (GPT-4 (Teaching-Inspired))