NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding

Read original: arXiv:2404.13627 - Published 7/8/2024 by Chunkit Chan, Cheng Jiayang, Yauwai Yim, Zheye Deng, Wei Fan, Haoran Li, Xin Liu, Hongming Zhang, Weiqi Wang, Yangqiu Song
Total Score

0

NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper introduces NegotiationToM, a new benchmark for evaluating the ability of AI systems to reason about the mental states of negotiation partners.
  • The benchmark involves a series of simulated negotiation scenarios where AI agents must infer the beliefs, desires, and intentions of their counterparts in order to reach optimal outcomes.
  • The goal is to stress-test the "theory of mind" capabilities of AI systems and better understand their limitations in complex social reasoning tasks.

Plain English Explanation

The researchers have created a new test called NegotiationToM to see how well AI systems can understand the thoughts and feelings of other negotiators. In a negotiation, you have to think about what the other person wants, what they believe, and what they intend to do. This is called "theory of mind" - the ability to imagine what's going on in someone else's mind.

The NegotiationToM benchmark puts AI agents through a series of simulated negotiation scenarios. The agents have to try to figure out what their negotiation partners are thinking and feeling in order to reach the best deals. This is a challenging task that tests the limits of an AI's social reasoning capabilities.

The goal is to better understand the strengths and weaknesses of current AI systems when it comes to this type of complex social interaction. By seeing how well AIs perform on the NegotiationToM benchmark, researchers can identify areas where AI theory of mind abilities need to be improved.

Technical Explanation

The paper introduces the NegotiationToM benchmark, which is designed to evaluate the "theory of mind" (ToM) capabilities of AI systems in the context of negotiation scenarios. Theory of mind is the ability to attribute mental states like beliefs, desires, and intentions to others.

The benchmark consists of a set of simulated negotiation tasks where AI agents must reason about the mental states of their negotiation partners in order to reach optimal outcomes. The scenarios involve information asymmetry, conflicting interests, and other factors that require sophisticated social reasoning.

To assess the AI agents' ToM abilities, the researchers measure metrics like the agents' ability to predict their partners' actions, the quality of the negotiation outcomes, and the efficiency of the negotiation process. The benchmark is designed to be challenging, with the goal of stress-testing the limits of current AI systems' social intelligence.

The authors also provide a set of baseline AI agents trained using reinforcement learning techniques. These agents demonstrate varying degrees of ToM capabilities, establishing a benchmark for future AI systems to be compared against.

Critical Analysis

The NegotiationToM benchmark represents an important step forward in evaluating the social intelligence of AI systems. Assessing an AI's ability to reason about the mental states of others is crucial for developing systems that can engage in natural, human-like interactions.

However, the paper acknowledges several limitations of the benchmark. First, the scenarios are relatively narrow in scope, focusing only on negotiation tasks. Expanding the benchmark to include a wider range of social interactions would provide a more comprehensive evaluation of ToM abilities.

Additionally, the paper does not address the challenge of how to interpret the benchmark results. Differences in performance between AI agents may reflect underlying differences in their ToM capabilities, but could also be influenced by factors like training data or architectural choices. Further research is needed to establish a clear link between benchmark scores and genuine social intelligence.

Finally, the paper does not explore the potential impacts, both positive and negative, of developing AI systems with advanced ToM abilities. As these systems become more capable of understanding and manipulating human behavior, it will be important to consider the ethical implications and potential societal implications.

Conclusion

The NegotiationToM benchmark represents an important step forward in the effort to develop AI systems that can engage in natural, human-like social interactions. By stress-testing the theory of mind capabilities of AI agents in simulated negotiation scenarios, the benchmark provides valuable insights into the current limitations of AI social reasoning.

While the benchmark has some limitations, it lays the groundwork for future research in this critical area of AI development. As the field of artificial intelligence continues to advance, the ability to understand and reason about the mental states of others will become increasingly important for creating AI systems that can effectively cooperate and collaborate with humans.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding
Total Score

0

NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding

Chunkit Chan, Cheng Jiayang, Yauwai Yim, Zheye Deng, Wei Fan, Haoran Li, Xin Liu, Hongming Zhang, Weiqi Wang, Yangqiu Song

Large Language Models (LLMs) have sparked substantial interest and debate concerning their potential emergence of Theory of Mind (ToM) ability. Theory of mind evaluations currently focuses on testing models using machine-generated data or game settings prone to shortcuts and spurious correlations, which lacks evaluation of machine ToM ability in real-world human interaction scenarios. This poses a pressing demand to develop new real-world scenario benchmarks. We introduce NegotiationToM, a new benchmark designed to stress-test machine ToM in real-world negotiation surrounding covered multi-dimensional mental states (i.e., desires, beliefs, and intentions). Our benchmark builds upon the Belief-Desire-Intention (BDI) agent modeling theory and conducts the necessary empirical experiments to evaluate large language models. Our findings demonstrate that NegotiationToM is challenging for state-of-the-art LLMs, as they consistently perform significantly worse than humans, even when employing the chain-of-thought (CoT) method.

Read more

7/8/2024

💬

Total Score

0

OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models

Hainiu Xu, Runcong Zhao, Lixing Zhu, Jinhua Du, Yulan He

Neural Theory-of-Mind (N-ToM), machine's ability to understand and keep track of the mental states of others, is pivotal in developing socially intelligent agents. However, prevalent N-ToM benchmarks have several shortcomings, including the presence of ambiguous and artificial narratives, absence of personality traits and preferences, a lack of questions addressing characters' psychological mental states, and limited diversity in the questions posed. In response to these issues, we construct OpenToM, a new benchmark for assessing N-ToM with (1) longer and clearer narrative stories, (2) characters with explicit personality traits, (3) actions that are triggered by character intentions, and (4) questions designed to challenge LLMs' capabilities of modeling characters' mental states of both the physical and psychological world. Using OpenToM, we reveal that state-of-the-art LLMs thrive at modeling certain aspects of mental states in the physical world but fall short when tracking characters' mental states in the psychological world.

Read more

6/4/2024

MMToM-QA: Multimodal Theory of Mind Question Answering
Total Score

0

MMToM-QA: Multimodal Theory of Mind Question Answering

Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua B. Tenenbaum, Tianmin Shu

Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.

Read more

6/18/2024

🤷

Total Score

0

Views Are My Own, but Also Yours: Benchmarking Theory of Mind Using Common Ground

Adil Soubki, John Murzaku, Arash Yousefi Jordehi, Peter Zeng, Magdalena Markowska, Seyed Abolghasem Mirroshandel, Owen Rambow

Evaluating the theory of mind (ToM) capabilities of language models (LMs) has recently received a great deal of attention. However, many existing benchmarks rely on synthetic data, which risks misaligning the resulting experiments with human behavior. We introduce the first ToM dataset based on naturally occurring spoken dialogs, Common-ToM, and show that LMs struggle to demonstrate ToM. We then show that integrating a simple, explicit representation of beliefs improves LM performance on Common-ToM.

Read more

6/7/2024