OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models

2402.06044

Published 6/4/2024 by Hainiu Xu, Runcong Zhao, Lixing Zhu, Jinhua Du, Yulan He

💬

Abstract

Neural Theory-of-Mind (N-ToM), machine's ability to understand and keep track of the mental states of others, is pivotal in developing socially intelligent agents. However, prevalent N-ToM benchmarks have several shortcomings, including the presence of ambiguous and artificial narratives, absence of personality traits and preferences, a lack of questions addressing characters' psychological mental states, and limited diversity in the questions posed. In response to these issues, we construct OpenToM, a new benchmark for assessing N-ToM with (1) longer and clearer narrative stories, (2) characters with explicit personality traits, (3) actions that are triggered by character intentions, and (4) questions designed to challenge LLMs' capabilities of modeling characters' mental states of both the physical and psychological world. Using OpenToM, we reveal that state-of-the-art LLMs thrive at modeling certain aspects of mental states in the physical world but fall short when tracking characters' mental states in the psychological world.

Create account to get full access

Overview

The paper focuses on a new benchmark called OpenToM for assessing large language models' (LLMs) ability to understand and keep track of others' mental states, known as Neural Theory-of-Mind (N-ToM).
Prevalent N-ToM benchmarks have several shortcomings, such as ambiguous narratives, lack of character personality traits, and limited diversity in the questions posed.
OpenToM aims to address these issues by providing longer and clearer narrative stories, characters with explicit personality traits, and questions designed to challenge LLMs' capabilities in modeling both physical and psychological mental states.
The paper reveals that state-of-the-art LLMs excel at modeling certain aspects of mental states in the physical world but struggle when it comes to tracking characters' mental states in the psychological world.

Plain English Explanation

The paper discusses a new benchmark called OpenToM that aims to better assess a machine's ability to understand and keep track of the mental states of others, which is known as Neural Theory-of-Mind (N-ToM). This skill is crucial for developing socially intelligent agents.

The researchers found that existing N-ToM benchmarks have some problems, such as using ambiguous or artificial stories, not giving characters clear personality traits, and not asking enough questions about the characters' psychological mental states. To address these issues, the researchers created OpenToM, which has longer and clearer stories, characters with explicit personalities, and a wider range of questions to challenge language models' ability to understand both the physical and psychological aspects of the characters' mental states.

When they tested state-of-the-art language models using OpenToM, the researchers found that the models were good at understanding the characters' mental states related to the physical world, but struggled more when it came to tracking the characters' psychological mental states.

Technical Explanation

The paper introduces a new benchmark called OpenToM for assessing the ability of large language models (LLMs) to understand and keep track of others' mental states, a capability known as Neural Theory-of-Mind (N-ToM).

The researchers identify several shortcomings in prevalent N-ToM benchmarks, including:

Ambiguous and artificial narratives
Lack of explicit personality traits for characters
Absence of questions addressing characters' psychological mental states
Limited diversity in the types of questions posed

To address these issues, the researchers constructed OpenToM with the following key features:

Longer and clearer narrative stories
Characters with explicit personality traits
Actions triggered by character intentions
Questions designed to challenge LLMs' capabilities in modeling both physical and psychological mental states

Using OpenToM, the researchers evaluated state-of-the-art LLMs and found that while the models excel at modeling certain aspects of mental states in the physical world, they fall short when it comes to tracking characters' mental states in the psychological world. This suggests that current LLMs still have limitations in their ability to represent beliefs about the self and others and delegating theory-of-mind reasoning to language models.

Critical Analysis

The paper makes a valuable contribution by introducing OpenToM, a new benchmark that addresses several shortcomings of existing N-ToM assessments. The researchers have thoughtfully designed the benchmark to better capture the nuances of character mental states, which is an important step towards developing more aligned theory-of-mind capabilities in language models.

However, the paper does not delve into potential limitations of the OpenToM benchmark itself. For example, it would be helpful to understand how the narratives, character traits, and question types were selected and validated, and whether there are any biases or constraints inherent in the design choices.

Additionally, the paper could have explored in more depth the specific areas where state-of-the-art LLMs struggle in modeling psychological mental states. Understanding the underlying reasons for these limitations could inform future research and model development efforts.

Conclusion

The paper presents a new benchmark called OpenToM that aims to more comprehensively assess the ability of large language models to understand and keep track of others' mental states, a crucial capability for developing socially intelligent agents. By addressing several shortcomings of existing N-ToM benchmarks, OpenToM provides a more robust and challenging test of LLMs' theory-of-mind reasoning skills.

The results revealed that while current state-of-the-art LLMs can excel at modeling certain aspects of mental states in the physical world, they struggle when it comes to tracking characters' mental states in the psychological realm. This suggests that there is still significant room for improvement in the field of machine theory-of-mind, and the OpenToM benchmark can serve as a valuable tool for driving progress in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding

Chunkit Chan, Cheng Jiayang, Yauwai Yim, Zheye Deng, Wei Fan, Haoran Li, Xin Liu, Hongming Zhang, Weiqi Wang, Yangqiu Song

Large Language Models (LLMs) have sparked substantial interest and debate concerning their potential emergence of Theory of Mind (ToM) ability. Theory of mind evaluations currently focuses on testing models using machine-generated data or game settings prone to shortcuts and spurious correlations, which lacks evaluation of machine ToM ability in real-world human interaction scenarios. This poses a pressing demand to develop new real-world scenario benchmarks. We introduce NegotiationToM, a new benchmark designed to stress-test machine ToM in real-world negotiation surrounding covered multi-dimensional mental states (i.e., desires, beliefs, and intentions). Our benchmark builds upon the Belief-Desire-Intention (BDI) agent modeling theory and conducts the necessary empirical experiments to evaluate large language models. Our findings demonstrate that NegotiationToM is challenging for state-of-the-art LLMs, as they consistently perform significantly worse than humans, even when employing the chain-of-thought (CoT) method.

4/23/2024

cs.CL cs.AI

MMToM-QA: Multimodal Theory of Mind Question Answering

Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua B. Tenenbaum, Tianmin Shu

Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.

6/18/2024

cs.AI cs.CL cs.CV cs.LG

A Notion of Complexity for Theory of Mind via Discrete World Models

X. Angelo Huang, Emanuele La Malfa, Samuele Marro, Andrea Asperti, Anthony Cohn, Michael Wooldridge

Theory of Mind (ToM) can be used to assess the capabilities of Large Language Models (LLMs) in complex scenarios where social reasoning is required. While the research community has proposed many ToM benchmarks, their hardness varies greatly, and their complexity is not well defined. This work proposes a framework to measure the complexity of ToM tasks. We quantify a problem's complexity as the number of states necessary to solve it correctly. Our complexity measure also accounts for spurious states of a ToM problem designed to make it apparently harder. We use our method to assess the complexity of five widely adopted ToM benchmarks. On top of this framework, we design a prompting technique that augments the information available to a model with a description of how the environment changes with the agents' interactions. We name this technique Discrete World Models (DWM) and show how it elicits superior performance on ToM tasks.

6/19/2024

cs.AI cs.CL cs.LG

Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses

Maryam Amirizaniani, Elias Martin, Maryna Sivachenko, Afra Mashhadi, Chirag Shah

Theory of Mind (ToM) reasoning entails recognizing that other individuals possess their own intentions, emotions, and thoughts, which is vital for guiding one's own thought processes. Although large language models (LLMs) excel in tasks such as summarization, question answering, and translation, they still face challenges with ToM reasoning, especially in open-ended questions. Despite advancements, the extent to which LLMs truly understand ToM reasoning and how closely it aligns with human ToM reasoning remains inadequately explored in open-ended scenarios. Motivated by this gap, we assess the abilities of LLMs to perceive and integrate human intentions and emotions into their ToM reasoning processes within open-ended questions. Our study utilizes posts from Reddit's ChangeMyView platform, which demands nuanced social reasoning to craft persuasive responses. Our analysis, comparing semantic similarity and lexical overlap metrics between responses generated by humans and LLMs, reveals clear disparities in ToM reasoning capabilities in open-ended questions, with even the most advanced models showing notable limitations. To enhance LLM capabilities, we implement a prompt tuning method that incorporates human intentions and emotions, resulting in improvements in ToM reasoning performance. However, despite these improvements, the enhancement still falls short of fully achieving human-like reasoning. This research highlights the deficiencies in LLMs' social reasoning and demonstrates how integrating human intentions and emotions can boost their effectiveness.

6/11/2024

cs.CL cs.AI