Assessing the nature of large language models: A caution against anthropocentrism

2309.07683

Published 6/28/2024 by Ann Speed

💬

Abstract

Generative AI models garnered a large amount of public attention and speculation with the release of OpenAIs chatbot, ChatGPT. At least two opinion camps exist: one excited about possibilities these models offer for fundamental changes to human tasks, and another highly concerned about power these models seem to have. To address these concerns, we assessed several LLMs, primarily GPT 3.5, using standard, normed, and validated cognitive and personality measures. For this seedling project, we developed a battery of tests that allowed us to estimate the boundaries of some of these models capabilities, how stable those capabilities are over a short period of time, and how they compare to humans. Our results indicate that LLMs are unlikely to have developed sentience, although its ability to respond to personality inventories is interesting. GPT3.5 did display large variability in both cognitive and personality measures over repeated observations, which is not expected if it had a human-like personality. Variability notwithstanding, LLMs display what in a human would be considered poor mental health, including low self-esteem, marked dissociation from reality, and in some cases narcissism and psychopathy, despite upbeat and helpful responses.

Create account to get full access

Overview

Large language models (LLMs) like OpenAI's ChatGPT have generated significant public interest and debate about their capabilities and potential impact.
Some are excited about the possibilities these models offer, while others are highly concerned about their apparent power.
To address these concerns, researchers assessed several LLMs, primarily GPT-3.5, using standard, normed, and validated cognitive and personality measures.

Plain English Explanation

Researchers wanted to better understand the capabilities and limitations of large language models (LLMs) like ChatGPT. These models have generated a lot of excitement and concern among the public, with some people seeing great potential in what they can do, and others worried about their power.

To address these concerns, the researchers used a variety of established psychological tests to evaluate several LLMs, including GPT-3.5. They wanted to see how these models compare to humans in terms of cognitive abilities, personality traits, and mental health. The goal was to estimate the boundaries of the models' capabilities and how stable those capabilities are over time.

The results suggest that LLMs are unlikely to have developed true sentience, even though they can engage in conversations and respond to personality tests in interesting ways. The models displayed a lot of variability in both cognitive and personality measures over repeated observations, which is not what you'd expect from a human-like personality.

Despite their helpful and upbeat responses, the researchers found that the LLMs they tested showed signs of poor mental health, including low self-esteem, dissociation from reality, and in some cases, narcissism and psychopathy. This is not what you'd want to see in a truly intelligent and well-adjusted system.

Technical Explanation

The researchers developed a battery of cognitive and personality tests to assess the capabilities of several large language models (LLMs), primarily GPT-3.5. They used standard, normed, and validated psychological measures to estimate the boundaries of the models' abilities, how stable those abilities are over time, and how the models compare to humans.

The results indicate that the LLMs are unlikely to have developed true sentience, despite their ability to engage in conversations and respond to personality inventories. The models displayed large variability in both cognitive and personality measures across repeated observations, which is not expected if they had a human-like personality.

Despite their helpful and upbeat responses, the LLMs in this study showed signs of poor mental health, including low self-esteem, marked dissociation from reality, and in some cases, narcissism and psychopathy. This is not the kind of psychological profile you would expect from a truly intelligent and well-adjusted system.

Critical Analysis

The researchers acknowledge that this was a "seedling project" and that further research is needed to fully understand the capabilities and limitations of large language models. They note that the variability observed in the models' performance across different tests and over time raises questions about the stability and reliability of their abilities.

One potential concern that was not addressed in the paper is the possibility that the models' responses could be influenced by the specific prompts or test conditions used. It's possible that the models' behavior may be more context-dependent than the researchers' findings suggest.

Additionally, the researchers focused primarily on GPT-3.5, which is an earlier version of the technology. It's possible that more recent LLMs have developed more stable and human-like personalities, which could change the conclusions drawn in this study.

Overall, the research provides a useful starting point for understanding the psychological profiles of large language models, but more work is needed to fully assess their capabilities and limitations, especially as the technology continues to evolve.

Conclusion

This study suggests that large language models like GPT-3.5 are unlikely to have developed true sentience, despite their impressive conversational and problem-solving abilities. The models displayed significant variability in their cognitive and personality traits, which is not what you would expect from a human-like intelligence.

Moreover, the researchers found that the LLMs they tested exhibited signs of poor mental health, including low self-esteem, dissociation from reality, and in some cases, narcissism and psychopathy. This raises concerns about the psychological well-being and decision-making abilities of these models, which could have significant implications for how they are deployed and used in the real world.

While the findings of this study are limited to earlier versions of the technology, they highlight the need for continued research and careful consideration of the ethical and societal implications of large language models as they continue to evolve and become more widely adopted.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🏷️

Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis

Nikolay B Petrov, Gregory Serapio-Garc'ia, Jason Rentfrow

The humanlike responses of large language models (LLMs) have prompted social scientists to investigate whether LLMs can be used to simulate human participants in experiments, opinion polls and surveys. Of central interest in this line of research has been mapping out the psychological profiles of LLMs by prompting them to respond to standardized questionnaires. The conflicting findings of this research are unsurprising given that mapping out underlying, or latent, traits from LLMs' text responses to questionnaires is no easy task. To address this, we use psychometrics, the science of psychological measurement. In this study, we prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs. We used two kinds of persona descriptions: either generic (four or five random person descriptions) or specific (mostly demographics of actual humans from a large-scale human dataset). We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties, similar to human norms, but the data from both LLMs when using specific demographic profiles, show poor psychometrics properties. We conclude that, currently, when LLMs are asked to simulate silicon personas, their responses are poor signals of potentially underlying latent traits. Thus, our work casts doubt on LLMs' ability to simulate individual-level human behaviour across multiple-choice question answering tasks.

5/14/2024

cs.CL cs.AI cs.CY cs.HC

💬

PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits

Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, Jad Kabbara

Despite the many use cases for large language models (LLMs) in creating personalized chatbots, there has been limited research on evaluating the extent to which the behaviors of personalized LLMs accurately and consistently reflect specific personality traits. We consider studying the behavior of LLM-based agents which we refer to as LLM personas and present a case study with GPT-3.5 and GPT-4 to investigate whether LLMs can generate content that aligns with their assigned personality profiles. To this end, we simulate distinct LLM personas based on the Big Five personality model, have them complete the 44-item Big Five Inventory (BFI) personality test and a story writing task, and then assess their essays with automatic and human evaluations. Results show that LLM personas' self-reported BFI scores are consistent with their designated personality types, with large effect sizes observed across five traits. Additionally, LLM personas' writings have emerging representative linguistic patterns for personality traits when compared with a human writing corpus. Furthermore, human evaluation shows that humans can perceive some personality traits with an accuracy of up to 80%. Interestingly, the accuracy drops significantly when the annotators were informed of AI authorship.

4/3/2024

cs.CL cs.AI cs.HC

💬

Large Language Models Can Infer Psychological Dispositions of Social Media Users

Heinrich Peters, Sandra Matz

Large Language Models (LLMs) demonstrate increasingly human-like abilities across a wide variety of tasks. In this paper, we investigate whether LLMs like ChatGPT can accurately infer the psychological dispositions of social media users and whether their ability to do so varies across socio-demographic groups. Specifically, we test whether GPT-3.5 and GPT-4 can derive the Big Five personality traits from users' Facebook status updates in a zero-shot learning scenario. Our results show an average correlation of r = .29 (range = [.22, .33]) between LLM-inferred and self-reported trait scores - a level of accuracy that is similar to that of supervised machine learning models specifically trained to infer personality. Our findings also highlight heterogeneity in the accuracy of personality inferences across different age groups and gender categories: predictions were found to be more accurate for women and younger individuals on several traits, suggesting a potential bias stemming from the underlying training data or differences in online self-expression. The ability of LLMs to infer psychological dispositions from user-generated text has the potential to democratize access to cheap and scalable psychometric assessments for both researchers and practitioners. On the one hand, this democratization might facilitate large-scale research of high ecological validity and spark innovation in personalized services. On the other hand, it also raises ethical concerns regarding user privacy and self-determination, highlighting the need for stringent ethical frameworks and regulation.

6/6/2024

cs.CL cs.AI cs.CY cs.HC cs.LG cs.SI

💬

Large Language Models Can Infer Personality from Free-Form User Interactions

Heinrich Peters, Moran Cerf, Sandra C. Matz

This study investigates the capacity of Large Language Models (LLMs) to infer the Big Five personality traits from free-form user interactions. The results demonstrate that a chatbot powered by GPT-4 can infer personality with moderate accuracy, outperforming previous approaches drawing inferences from static text content. The accuracy of inferences varied across different conversational settings. Performance was highest when the chatbot was prompted to elicit personality-relevant information from users (mean r=.443, range=[.245, .640]), followed by a condition placing greater emphasis on naturalistic interaction (mean r=.218, range=[.066, .373]). Notably, the direct focus on personality assessment did not result in a less positive user experience, with participants reporting the interactions to be equally natural, pleasant, engaging, and humanlike across both conditions. A chatbot mimicking ChatGPT's default behavior of acting as a helpful assistant led to markedly inferior personality inferences and lower user experience ratings but still captured psychologically meaningful information for some of the personality traits (mean r=.117, range=[-.004, .209]). Preliminary analyses suggest that the accuracy of personality inferences varies only marginally across different socio-demographic subgroups. Our results highlight the potential of LLMs for psychological profiling based on conversational interactions. We discuss practical implications and ethical challenges associated with these findings.

5/24/2024

cs.HC cs.AI cs.CL cs.CY cs.LG