Revisiting the Reliability of Psychological Scales on Large Language Models

    Read original: arXiv:2305.19926 - Published 10/7/2024 by Jen-tse Huang, Wenxiang Jiao, Man Ho Lam, Eric John Li, Wenxuan Wang, Michael R. Lyu
    Total Score

    0

    ๐Ÿ’ฌ

    Sign in to get full access

    or

    If you already have an account, we'll log you in

    Overview

    • The research examines the characteristics of Large Language Models (LLMs) from a psychological perspective.
    • Administering personality tests to LLMs has emerged as an area of interest to understand their behavioral patterns.
    • There is an ongoing debate about the suitability of using psychological scales designed for humans on LLMs.
    • The study aims to determine the reliability of applying personality assessments to LLMs and investigate whether they demonstrate consistent personality traits.

    Plain English Explanation

    The study explores how well Large Language Models (LLMs) like GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1 can be evaluated using personality tests. Personality tests are typically designed for humans, so the researchers wanted to see if they would also work for these AI language models.

    The researchers analyzed 2,500 settings for each LLM and found that the models showed consistent responses on the Big Five Inventory, a common personality assessment. This suggests that the personality tests can be reliably used to understand the behavioral characteristics of these LLMs.

    Additionally, the researchers explored how the GPT-3.5 model can be instructed to emulate different personalities. This could be useful in social science research, where LLMs could potentially be used instead of human participants, reducing the costs of studies.

    Technical Explanation

    The researchers conducted an analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, to investigate the reliability of applying personality assessments to LLMs. They specifically looked at the Big Five Inventory, a widely used personality test, and found that the LLMs demonstrated consistent responses, indicating a satisfactory level of reliability.

    Furthermore, the study explored the potential of the GPT-3.5 model to emulate diverse personalities and represent various groups. This capability is increasingly sought after in social sciences, as LLMs could potentially be used as substitutes for human participants in research studies, reducing the associated costs.

    Critical Analysis

    The research provides valuable insights into the psychological characteristics of LLMs and the potential for using personality assessments to understand their behavior. However, the authors acknowledge that the suitability of employing human-centric psychological scales on LLMs is an ongoing debate, and further research is needed to fully address this issue.

    Additionally, the study focuses on a limited set of LLMs and does not explore the potential differences in personality traits across a wider range of models. It would be beneficial to expand the research to include a more diverse set of LLMs to gain a more comprehensive understanding of their behavioral characteristics.

    Conclusion

    The study demonstrates that LLMs can exhibit consistent personality traits, as measured by the Big Five Inventory, indicating the potential for reliable personality assessments of these AI systems. This finding opens up possibilities for using LLMs as substitutes for human participants in social science research, potentially reducing the costs associated with traditional studies.

    However, the suitability of applying human-centric psychological scales to LLMs remains a topic of ongoing discussion, and further research is needed to fully explore the limitations and implications of this approach.



    This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

    Follow @aimodelsfyi on ๐• โ†’

    Related Papers

    ๐Ÿ’ฌ

    Total Score

    0

    Revisiting the Reliability of Psychological Scales on Large Language Models

    Jen-tse Huang, Wenxiang Jiao, Man Ho Lam, Eric John Li, Wenxuan Wang, Michael R. Lyu

    Recent research has focused on examining Large Language Models' (LLMs) characteristics from a psychological standpoint, acknowledging the necessity of understanding their behavioral characteristics. The administration of personality tests to LLMs has emerged as a noteworthy area in this context. However, the suitability of employing psychological scales, initially devised for humans, on LLMs is a matter of ongoing debate. Our study aims to determine the reliability of applying personality assessments to LLMs, explicitly investigating whether LLMs demonstrate consistent personality traits. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory, indicating a satisfactory level of reliability. Furthermore, our research explores the potential of GPT-3.5 to emulate diverse personalities and represent various groups-a capability increasingly sought after in social sciences for substituting human participants with LLMs to reduce costs. Our findings reveal that LLMs have the potential to represent different personalities with specific prompt instructions.

    Read more

    10/7/2024

    ๐Ÿ’ฌ

    Total Score

    0

    Challenging the Validity of Personality Tests for Large Language Models

    Tom Suhr, Florian E. Dorner, Samira Samadi, Augustin Kelava

    With large language models (LLMs) like GPT-4 appearing to behave increasingly human-like in text-based interactions, it has become popular to attempt to evaluate personality traits of LLMs using questionnaires originally developed for humans. While reusing measures is a resource-efficient way to evaluate LLMs, careful adaptations are usually required to ensure that assessment results are valid even across human subpopulations. In this work, we provide evidence that LLMs' responses to personality tests systematically deviate from human responses, implying that the results of these tests cannot be interpreted in the same way. Concretely, reverse-coded items (I am introverted vs. I am extraverted) are often both answered affirmatively. Furthermore, variation across prompts designed to steer LLMs to simulate particular personality types does not follow the clear separation into five independent personality factors from human samples. In light of these results, we believe that it is important to investigate tests' validity for LLMs before drawing strong conclusions about potentially ill-defined concepts like LLMs' personality.

    Read more

    6/6/2024

    ๐Ÿงช

    Total Score

    0

    Personality testing of Large Language Models: Limited temporal stability, but highlighted prosociality

    Bojana Bodroza, Bojana M. Dinic, Ljubisa Bojic

    As Large Language Models (LLMs) continue to gain popularity due to their human-like traits and the intimacy they offer to users, their societal impact inevitably expands. This leads to the rising necessity for comprehensive studies to fully understand LLMs and reveal their potential opportunities, drawbacks, and overall societal impact. With that in mind, this research conducted an extensive investigation into seven LLM's, aiming to assess the temporal stability and inter-rater agreement on their responses on personality instruments in two time points. In addition, LLMs personality profile was analyzed and compared to human normative data. The findings revealed varying levels of inter-rater agreement in the LLMs responses over a short time, with some LLMs showing higher agreement (e.g., LIama3 and GPT-4o) compared to others (e.g., GPT-4 and Gemini). Furthermore, agreement depended on used instruments as well as on domain or trait. This implies the variable robustness in LLMs' ability to reliably simulate stable personality characteristics. In the case of scales which showed at least fair agreement, LLMs displayed mostly a socially desirable profile in both agentic and communal domains, as well as a prosocial personality profile reflected in higher agreeableness and conscientiousness and lower Machiavellianism. Exhibiting temporal stability and coherent responses on personality traits is crucial for AI systems due to their societal impact and AI safety concerns.

    Read more

    7/30/2024

    ๐Ÿ’ฌ

    Total Score

    0

    PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits

    Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, Jad Kabbara

    Despite the many use cases for large language models (LLMs) in creating personalized chatbots, there has been limited research on evaluating the extent to which the behaviors of personalized LLMs accurately and consistently reflect specific personality traits. We consider studying the behavior of LLM-based agents which we refer to as LLM personas and present a case study with GPT-3.5 and GPT-4 to investigate whether LLMs can generate content that aligns with their assigned personality profiles. To this end, we simulate distinct LLM personas based on the Big Five personality model, have them complete the 44-item Big Five Inventory (BFI) personality test and a story writing task, and then assess their essays with automatic and human evaluations. Results show that LLM personas' self-reported BFI scores are consistent with their designated personality types, with large effect sizes observed across five traits. Additionally, LLM personas' writings have emerging representative linguistic patterns for personality traits when compared with a human writing corpus. Furthermore, human evaluation shows that humans can perceive some personality traits with an accuracy of up to 80%. Interestingly, the accuracy drops significantly when the annotators were informed of AI authorship.

    Read more

    4/3/2024