Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis

2405.07248

Published 5/14/2024 by Nikolay B Petrov, Gregory Serapio-Garc'ia, Jason Rentfrow

🏷️

Abstract

The humanlike responses of large language models (LLMs) have prompted social scientists to investigate whether LLMs can be used to simulate human participants in experiments, opinion polls and surveys. Of central interest in this line of research has been mapping out the psychological profiles of LLMs by prompting them to respond to standardized questionnaires. The conflicting findings of this research are unsurprising given that mapping out underlying, or latent, traits from LLMs' text responses to questionnaires is no easy task. To address this, we use psychometrics, the science of psychological measurement. In this study, we prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs. We used two kinds of persona descriptions: either generic (four or five random person descriptions) or specific (mostly demographics of actual humans from a large-scale human dataset). We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties, similar to human norms, but the data from both LLMs when using specific demographic profiles, show poor psychometrics properties. We conclude that, currently, when LLMs are asked to simulate silicon personas, their responses are poor signals of potentially underlying latent traits. Thus, our work casts doubt on LLMs' ability to simulate individual-level human behaviour across multiple-choice question answering tasks.

Create account to get full access

Overview

Researchers are investigating whether large language models (LLMs) can simulate human participants in experiments, opinion polls, and surveys.
A key focus is mapping out the psychological profiles of LLMs by prompting them to respond to standardized questionnaires.
However, translating LLMs' text responses into underlying personality traits is challenging.
This study uses psychometrics, the science of psychological measurement, to evaluate the ability of GPT-3.5 and GPT-4 to assume different personas and respond to personality tests.

Plain English Explanation

Researchers are exploring whether large language models like GPT-3.5 and GPT-4 can be used to stand in for human participants in experiments, surveys, and opinion polls. A key part of this research is trying to map out the "psychological profiles" of these AI models - in other words, understanding their personalities and thought patterns.

This is challenging because translating an AI's text responses into underlying personality traits is not straightforward. To address this, the researchers in this study used psychometrics - the science of psychological measurement. They prompted the AI models to assume different personas, either generic or specific to real people, and then had them respond to standard personality tests.

The results showed that responses from GPT-4, but not GPT-3.5, using generic personas had reasonably good psychometric properties - meaning the AI's responses were similar to how real people would respond. However, when using specific demographic profiles, the responses from both AI models showed poor psychometric properties.

In other words, while GPT-4 may be able to somewhat realistically simulate generic human personas, it struggles to accurately reflect the individual-level behavior of specific people across multiple-choice personality tests. This casts doubt on the ability of current language models to truly simulate human-level behavior and decision-making.

Technical Explanation

This study investigated whether large language models (LLMs) such as GPT-3.5 and GPT-4 can be used to simulate human participants in experiments, opinion polls, and surveys. The researchers used psychometrics, the science of psychological measurement, to evaluate the ability of these LLMs to assume different personas and respond to standardized personality questionnaires.

The researchers prompted the LLMs to use two types of persona descriptions: generic (four or five random person descriptions) and specific (mostly demographics of actual humans from a large-scale dataset). They then had the LLMs respond to a range of standardized measures of personality constructs.

The results showed that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions exhibited promising, albeit not perfect, psychometric properties that were similar to human norms. However, the data from both LLMs when using specific demographic profiles showed poor psychometric properties.

These findings suggest that when LLMs are asked to simulate silicon personas, their responses are not reliable indicators of potentially underlying latent personality traits. The researchers conclude that this casts doubt on the ability of current LLMs to accurately simulate individual-level human behavior across multiple-choice question answering tasks.

Critical Analysis

The researchers acknowledge several limitations and caveats in their study. They note that mapping out the underlying personality traits of LLMs based on their text responses is an inherently challenging task, as the relationship between language and psychological constructs is complex and not fully understood.

Additionally, the researchers used a relatively small set of persona descriptions, both generic and specific, which may not fully capture the diversity of human experiences and behaviors. There is a need for further research using a wider range of persona profiles to better understand the capabilities and limitations of LLMs in this domain.

The researchers also highlight the fact that their findings are specific to the particular LLMs and personality measures used in the study. It is possible that other LLMs or different assessment tools may yield different results, and further research is needed to evaluate the generalizability of these findings.

Overall, this study raises important questions about the ability of current large language models to accurately simulate human-level behavior and decision-making, particularly at the individual level. While these models may show some promising psychometric properties when using generic persona descriptions, their performance deteriorates when tasked with simulating specific demographic profiles. This suggests that caution is warranted when considering the use of LLMs as a stand-in for human participants in experiments, surveys, and other research contexts.

Conclusion

This study investigated the ability of large language models (LLMs) like GPT-3.5 and GPT-4 to simulate human participants in experiments, opinion polls, and surveys by mapping out their psychological profiles. Using psychometric methods, the researchers found that while GPT-4 responses using generic persona descriptions showed some promising, albeit imperfect, psychometric properties, the responses from both LLMs using specific demographic profiles exhibited poor psychometric qualities.

These findings cast doubt on the ability of current LLMs to accurately simulate individual-level human behavior and decision-making across multiple-choice question answering tasks. While further research is needed to fully understand the capabilities and limitations of these models, this study suggests that caution is warranted when considering the use of LLMs as substitutes for human participants in various research contexts.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

New!Assessing the nature of large language models: A caution against anthropocentrism

Ann Speed

Generative AI models garnered a large amount of public attention and speculation with the release of OpenAIs chatbot, ChatGPT. At least two opinion camps exist: one excited about possibilities these models offer for fundamental changes to human tasks, and another highly concerned about power these models seem to have. To address these concerns, we assessed several LLMs, primarily GPT 3.5, using standard, normed, and validated cognitive and personality measures. For this seedling project, we developed a battery of tests that allowed us to estimate the boundaries of some of these models capabilities, how stable those capabilities are over a short period of time, and how they compare to humans. Our results indicate that LLMs are unlikely to have developed sentience, although its ability to respond to personality inventories is interesting. GPT3.5 did display large variability in both cognitive and personality measures over repeated observations, which is not expected if it had a human-like personality. Variability notwithstanding, LLMs display what in a human would be considered poor mental health, including low self-esteem, marked dissociation from reality, and in some cases narcissism and psychopathy, despite upbeat and helpful responses.

6/28/2024

cs.AI cs.CL cs.CY cs.HC

💬

PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits

Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, Jad Kabbara

Despite the many use cases for large language models (LLMs) in creating personalized chatbots, there has been limited research on evaluating the extent to which the behaviors of personalized LLMs accurately and consistently reflect specific personality traits. We consider studying the behavior of LLM-based agents which we refer to as LLM personas and present a case study with GPT-3.5 and GPT-4 to investigate whether LLMs can generate content that aligns with their assigned personality profiles. To this end, we simulate distinct LLM personas based on the Big Five personality model, have them complete the 44-item Big Five Inventory (BFI) personality test and a story writing task, and then assess their essays with automatic and human evaluations. Results show that LLM personas' self-reported BFI scores are consistent with their designated personality types, with large effect sizes observed across five traits. Additionally, LLM personas' writings have emerging representative linguistic patterns for personality traits when compared with a human writing corpus. Furthermore, human evaluation shows that humans can perceive some personality traits with an accuracy of up to 80%. Interestingly, the accuracy drops significantly when the annotators were informed of AI authorship.

4/3/2024

cs.CL cs.AI cs.HC

💬

Challenging the Validity of Personality Tests for Large Language Models

Tom Suhr, Florian E. Dorner, Samira Samadi, Augustin Kelava

With large language models (LLMs) like GPT-4 appearing to behave increasingly human-like in text-based interactions, it has become popular to attempt to evaluate personality traits of LLMs using questionnaires originally developed for humans. While reusing measures is a resource-efficient way to evaluate LLMs, careful adaptations are usually required to ensure that assessment results are valid even across human subpopulations. In this work, we provide evidence that LLMs' responses to personality tests systematically deviate from human responses, implying that the results of these tests cannot be interpreted in the same way. Concretely, reverse-coded items (I am introverted vs. I am extraverted) are often both answered affirmatively. Furthermore, variation across prompts designed to steer LLMs to simulate particular personality types does not follow the clear separation into five independent personality factors from human samples. In light of these results, we believe that it is important to investigate tests' validity for LLMs before drawing strong conclusions about potentially ill-defined concepts like LLMs' personality.

6/6/2024

cs.CL cs.AI cs.LG

Is persona enough for personality? Using ChatGPT to reconstruct an agent's latent personality from simple descriptions

Yongyi Ji, Zhisheng Tang, Mayank Kejriwal

Personality, a fundamental aspect of human cognition, contains a range of traits that influence behaviors, thoughts, and emotions. This paper explores the capabilities of large language models (LLMs) in reconstructing these complex cognitive attributes based only on simple descriptions containing socio-demographic and personality type information. Utilizing the HEXACO personality framework, our study examines the consistency of LLMs in recovering and predicting underlying (latent) personality dimensions from simple descriptions. Our experiments reveal a significant degree of consistency in personality reconstruction, although some inconsistencies and biases, such as a tendency to default to positive traits in the absence of explicit information, are also observed. Additionally, socio-demographic factors like age and number of children were found to influence the reconstructed personality dimensions. These findings have implications for building sophisticated agent-based simulacra using LLMs and highlight the need for further research on robust personality generation in LLMs.

6/19/2024

cs.CL cs.AI