Evaluating and Modeling Social Intelligence: A Comparative Study of Human and AI Capabilities

2405.11841

Published 5/21/2024 by Junqi Wang, Chunhui Zhang, Jiapeng Li, Yuxi Ma, Lixing Niu, Jiaheng Han, Yujia Peng, Yixin Zhu, Lifeng Fan

cs.AI

🤖

Abstract

Facing the current debate on whether Large Language Models (LLMs) attain near-human intelligence levels (Mitchell & Krakauer, 2023; Bubeck et al., 2023; Kosinski, 2023; Shiffrin & Mitchell, 2023; Ullman, 2023), the current study introduces a benchmark for evaluating social intelligence, one of the most distinctive aspects of human cognition. We developed a comprehensive theoretical framework for social dynamics and introduced two evaluation tasks: Inverse Reasoning (IR) and Inverse Inverse Planning (IIP). Our approach also encompassed a computational model based on recursive Bayesian inference, adept at elucidating diverse human behavioral patterns. Extensive experiments and detailed analyses revealed that humans surpassed the latest GPT models in overall performance, zero-shot learning, one-shot generalization, and adaptability to multi-modalities. Notably, GPT models demonstrated social intelligence only at the most basic order (order = 0), in stark contrast to human social intelligence (order >= 2). Further examination indicated a propensity of LLMs to rely on pattern recognition for shortcuts, casting doubt on their possession of authentic human-level social intelligence. Our codes, dataset, appendix and human data are released at https://github.com/bigai-ai/Evaluate-n-Model-Social-Intelligence.

Create account to get full access

Overview

The study introduces a benchmark for evaluating the social intelligence of large language models (LLMs), one of the key aspects of human cognition.
The researchers developed a theoretical framework for social dynamics and two evaluation tasks: Inverse Reasoning (IR) and Inverse Inverse Planning (IIP).
They also created a computational model based on recursive Bayesian inference to study diverse human behavioral patterns.
The experiments showed that humans outperformed the latest GPT models in overall performance, zero-shot learning, one-shot generalization, and adaptability to multi-modalities.
The study suggests that LLMs demonstrate social intelligence only at the most basic level, in contrast to the more advanced social intelligence exhibited by humans.

Plain English Explanation

The researchers wanted to understand how well large language models (LLMs), such as GPT, can display social intelligence - the ability to understand and navigate social situations, which is a crucial aspect of human cognition. They developed a framework to evaluate social intelligence and two specific tasks, Inverse Reasoning (IR) and Inverse Inverse Planning (IIP), to test this. They also created a computational model that could simulate diverse human behaviors.

When they tested the latest GPT models against humans, they found that humans performed better overall, could learn new skills more quickly (zero-shot and one-shot learning), and were more adaptable to different types of information (multi-modalities). Interestingly, the GPT models showed only the most basic level of social intelligence, while humans demonstrated a more advanced understanding of social dynamics (order >= 2).

The researchers suggest that the LLMs may be relying too heavily on pattern recognition to solve problems, rather than truly understanding social intelligence at a human-like level. This raises questions about whether these models have genuinely attained near-human intelligence, as some have claimed.

Technical Explanation

The researchers developed a comprehensive theoretical framework for social dynamics and introduced two evaluation tasks to assess the social intelligence of LLMs: Inverse Reasoning (IR) and Inverse Inverse Planning (IIP). The IR task requires the model to infer the underlying social dynamics from observed behaviors, while the IIP task involves predicting how an agent would plan their actions in a social setting.

To study these tasks, the researchers created a computational model based on recursive Bayesian inference, which was able to capture diverse human behavioral patterns.

When they tested the latest GPT models against humans on these tasks, the results revealed that humans significantly outperformed the LLMs in overall performance, zero-shot learning, one-shot generalization, and adaptability to multi-modalities. Notably, the GPT models demonstrated social intelligence only at the most basic order (order = 0), in stark contrast to the more advanced social intelligence exhibited by humans (order >= 2).

The researchers suggest that the LLMs may be relying too heavily on pattern recognition, rather than truly understanding social dynamics at a human-like level. This casts doubt on the claim that these models have attained near-human intelligence.

Critical Analysis

The researchers acknowledge several limitations and areas for further research in their study. For example, they note that their computational model may not fully capture the complexity of human social dynamics, and that additional tasks and benchmarks may be needed to more comprehensively evaluate social intelligence.

Additionally, the researchers do not address the potential biases or limitations of the human data used in their experiments, which could affect the reliability of the comparisons between human and LLM performance.

While the study raises important questions about the social intelligence capabilities of LLMs, it is worth considering how these models can be improved and leveraged to enable better social interactions. The researchers could have explored potential avenues for enhancing the social intelligence of LLMs, rather than focusing solely on their limitations.

Conclusion

This study provides a valuable framework for evaluating the social intelligence of large language models, a crucial aspect of human cognition that has been largely overlooked in the current debate on AI capabilities. The findings suggest that while LLMs may excel at certain tasks, they still fall short of human-level social intelligence, potentially relying too heavily on pattern recognition rather than a deeper understanding of social dynamics.

These insights highlight the need for continued research and development to improve the social intelligence of AI systems and to critically examine the claims of near-human intelligence being made about current LLMs. By pushing the boundaries of social intelligence evaluation, this study contributes to the ongoing discussion on the true capabilities and limitations of large language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

New!Assessing the nature of large language models: A caution against anthropocentrism

Ann Speed

Generative AI models garnered a large amount of public attention and speculation with the release of OpenAIs chatbot, ChatGPT. At least two opinion camps exist: one excited about possibilities these models offer for fundamental changes to human tasks, and another highly concerned about power these models seem to have. To address these concerns, we assessed several LLMs, primarily GPT 3.5, using standard, normed, and validated cognitive and personality measures. For this seedling project, we developed a battery of tests that allowed us to estimate the boundaries of some of these models capabilities, how stable those capabilities are over a short period of time, and how they compare to humans. Our results indicate that LLMs are unlikely to have developed sentience, although its ability to respond to personality inventories is interesting. GPT3.5 did display large variability in both cognitive and personality measures over repeated observations, which is not expected if it had a human-like personality. Variability notwithstanding, LLMs display what in a human would be considered poor mental health, including low self-esteem, marked dissociation from reality, and in some cases narcissism and psychopathy, despite upbeat and helpful responses.

6/28/2024

cs.AI cs.CL cs.CY cs.HC

InterIntent: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context

Ziyi Liu, Abhishek Anand, Pei Zhou, Jen-tse Huang, Jieyu Zhao

Large language models (LLMs) have demonstrated the potential to mimic human social intelligence. However, most studies focus on simplistic and static self-report or performance-based tests, which limits the depth and validity of the analysis. In this paper, we developed a novel framework, InterIntent, to assess LLMs' social intelligence by mapping their ability to understand and manage intentions in a game setting. We focus on four dimensions of social intelligence: situational awareness, self-regulation, self-awareness, and theory of mind. Each dimension is linked to a specific game task: intention selection, intention following, intention summarization, and intention guessing. Our findings indicate that while LLMs exhibit high proficiency in selecting intentions, achieving an accuracy of 88%, their ability to infer the intentions of others is significantly weaker, trailing human performance by 20%. Additionally, game performance correlates with intention understanding, highlighting the importance of the four components towards success in this game. These findings underline the crucial role of intention understanding in evaluating LLMs' social intelligence and highlight the potential of using social deduction games as a complex testbed to enhance LLM evaluation. InterIntent contributes a structured approach to bridging the evaluation gap in social intelligence within multiplayer games.

6/19/2024

cs.AI

📉

DialogBench: Evaluating LLMs as Human-like Dialogue Systems

Jiao Ou, Junda Lu, Che Liu, Yihong Tang, Fuzheng Zhang, Di Zhang, Kun Gai

Large language models (LLMs) have achieved remarkable breakthroughs in new dialogue capabilities by leveraging instruction tuning, which refreshes human impressions of dialogue systems. The long-standing goal of dialogue systems is to be human-like enough to establish long-term connections with users. Therefore, there has been an urgent need to evaluate LLMs as human-like dialogue systems. In this paper, we propose DialogBench, a dialogue evaluation benchmark that contains 12 dialogue tasks to probe the capabilities of LLMs as human-like dialogue systems should have. Specifically, we prompt GPT-4 to generate evaluation instances for each task. We first design the basic prompt based on widely used design principles and further mitigate the existing biases to generate higher-quality evaluation instances. Our extensive tests on English and Chinese DialogBench of 26 LLMs show that instruction tuning improves the human likeness of LLMs to a certain extent, but most LLMs still have much room for improvement as human-like dialogue systems. Interestingly, results also show that the positioning of assistant AI can make instruction tuning weaken the human emotional perception of LLMs and their mastery of information about human daily life.

4/1/2024

cs.CL cs.AI

🔎

A social path to human-like artificial intelligence

Edgar A. Du'e~nez-Guzm'an, Suzanne Sadedin, Jane X. Wang, Kevin R. McKee, Joel Z. Leibo

Traditionally, cognitive and computer scientists have viewed intelligence solipsistically, as a property of unitary agents devoid of social context. Given the success of contemporary learning algorithms, we argue that the bottleneck in artificial intelligence (AI) progress is shifting from data assimilation to novel data generation. We bring together evidence showing that natural intelligence emerges at multiple scales in networks of interacting agents via collective living, social relationships and major evolutionary transitions, which contribute to novel data generation through mechanisms such as population pressures, arms races, Machiavellian selection, social learning and cumulative culture. Many breakthroughs in AI exploit some of these processes, from multi-agent structures enabling algorithms to master complex games like Capture-The-Flag and StarCraft II, to strategic communication in Diplomacy and the shaping of AI data streams by other AIs. Moving beyond a solipsistic view of agency to integrate these mechanisms suggests a path to human-like compounding innovation through ongoing novel data generation.

5/28/2024

cs.AI cs.LG