Papers With Code 2 | ML Benchmarks, SotA Results & Code

The MoToMQA (Multi-Order Theory of Mind Question & Answer) benchmark is a test suite introduced to examine the extent to which large language models (LLMs) have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner¹.

The benchmark is based on a ToM test designed for human adults and involves answering true/false questions about characters in short-form stories¹. The test examines LLM ToM from orders 2-6, where the 'order of intentionality' is the number of mental states involved in a ToM reasoning process¹. For example, a third-order statement is "I think you believe that she knows"¹.

The MoToMQA benchmark is used to compare the performance of LLMs to a newly gathered adult human benchmark¹. It assesses how ToM order affects LLM performance, how LLM performance compares to human performance, and how LLM performance on ToM tasks compares to performance on factual tasks of equivalent syntactic complexity¹.

(1) LLMs achieve adult human performance on higher-order theory of mind tasks. https://arxiv.org/pdf/2405.18870. (2) UserBenchmark: PC Speed Test Tool - Compare Your PC. https://www.userbenchmark.com/Software. (3) MotionMark - browser bench. https://browserbench.org/MotionMark/.