Quantifying the Persona Effect in LLM Simulations

2402.10811

Published 6/18/2024 by Tiancheng Hu, Nigel Collier

🤯

Abstract

Large language models (LLMs) have shown remarkable promise in simulating human language and behavior. This study investigates how integrating persona variables-demographic, social, and behavioral factors-impacts LLMs' ability to simulate diverse perspectives. We find that persona variables account for <10% variance in annotations in existing subjective NLP datasets. Nonetheless, incorporating persona variables via prompting in LLMs provides modest but statistically significant improvements. Persona prompting is most effective in samples where many annotators disagree, but their disagreements are relatively minor. Notably, we find a linear relationship in our setting: the stronger the correlation between persona variables and human annotations, the more accurate the LLM predictions are using persona prompting. In a zero-shot setting, a powerful 70b model with persona prompting captures 81% of the annotation variance achievable by linear regression trained on ground truth annotations. However, for most subjective NLP datasets, where persona variables have limited explanatory power, the benefits of persona prompting are limited.

Create account to get full access

Overview

Large language models (LLMs) have shown remarkable abilities in simulating human language and behavior
This study investigates how integrating persona variables (demographic, social, and behavioral factors) impacts LLMs' ability to simulate diverse perspectives
The researchers found that persona variables account for less than 10% of the variance in annotations in existing subjective NLP datasets
Nonetheless, incorporating persona variables via prompting in LLMs provides modest but statistically significant improvements

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can generate human-like text. Researchers wanted to see if giving these models information about a person's characteristics, such as their age, gender, and personality traits, could help the models better simulate diverse perspectives and views.

The study found that these persona variables only explain a small portion (less than 10%) of the differences in how people annotate or label text in existing NLP (natural language processing) datasets. However, when the researchers explicitly included persona information in the prompts given to the LLMs, it did lead to modest but noticeable improvements in the models' performance.

The persona prompting was most effective in cases where many people disagreed on the annotations, but their disagreements were relatively minor. Interestingly, the researchers also found a direct relationship: the stronger the connection between the persona variables and the human annotations, the more accurate the LLM's predictions were when using the persona prompting.

In a test where the LLM had no prior training on the task (zero-shot), a powerful 70 billion parameter model with persona prompting was able to capture 81% of the annotation variance that a linear regression model trained on the ground truth annotations could achieve. But for most subjective NLP datasets, where the persona variables don't explain much of the differences, the benefits of persona prompting are limited.

Technical Explanation

The researchers investigated how integrating persona variables, including demographic, social, and behavioral factors, impacts large language models' (LLMs') ability to simulate diverse perspectives. They found that persona variables account for less than 10% of the variance in annotations in existing subjective NLP datasets.

However, the team discovered that incorporating persona variables via prompting in LLMs provides modest but statistically significant improvements in performance. This persona prompting was most effective in samples where many annotators disagreed, but their disagreements were relatively minor.

Notably, the researchers observed a linear relationship in their setting: the stronger the correlation between persona variables and human annotations, the more accurate the LLM predictions are using persona prompting. In a zero-shot setting, a powerful 70 billion parameter model with persona prompting captured 81% of the annotation variance achievable by linear regression trained on ground truth annotations.

Yet for most subjective NLP datasets, where persona variables have limited explanatory power, the benefits of persona prompting are constrained.

Critical Analysis

The paper provides valuable insights into the limited ability of large language models to fully simulate diverse human perspectives, even with the incorporation of persona variables. While the modest improvements from persona prompting are noteworthy, the researchers acknowledge that for most subjective NLP datasets, the benefits are limited due to the persona variables' lack of explanatory power.

One potential limitation is the reliance on existing NLP datasets, which may not fully capture the nuances and complexities of real-world human perspectives. Additionally, the study focuses on a specific set of persona variables and does not explore the potential impact of other factors, such as cultural, emotional, or contextual influences, on LLM simulation capabilities.

Further research could investigate alternative approaches to modeling and incorporating diverse human characteristics, as well as exploring the generalizability of these findings across different language models, tasks, and datasets. Examining the ethical implications of LLMs' limitations in simulating human diversity is also an important area for future consideration.

Conclusion

This study highlights the challenges large language models face in fully capturing the diversity of human perspectives, even when integrating persona variables. While persona prompting can provide modest improvements, the benefits are constrained by the limited explanatory power of these variables for most subjective NLP datasets.

As LLMs continue to advance, understanding their limitations in simulating human complexity is crucial. Ongoing research in this area can inform the responsible development and deployment of these powerful AI systems, ensuring they are designed to respect and represent the diverse range of human experiences and viewpoints.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Explicit and Implicit Large Language Model Personas Generate Opinions but Fail to Replicate Deeper Perceptions and Biases

Salvatore Giorgi, Tingting Liu, Ankit Aich, Kelsey Isman, Garrick Sherman, Zachary Fried, Jo~ao Sedoc, Lyle H. Ungar, Brenda Curtis

Large language models (LLMs) are increasingly being used in human-centered social scientific tasks, such as data annotation, synthetic data creation, and engaging in dialog. However, these tasks are highly subjective and dependent on human factors, such as one's environment, attitudes, beliefs, and lived experiences. Thus, employing LLMs (which do not have such human factors) in these tasks may result in a lack of variation in data, failing to reflect the diversity of human experiences. In this paper, we examine the role of prompting LLMs with human-like personas and asking the models to answer as if they were a specific human. This is done explicitly, with exact demographics, political beliefs, and lived experiences, or implicitly via names prevalent in specific populations. The LLM personas are then evaluated via (1) subjective annotation task (e.g., detecting toxicity) and (2) a belief generation task, where both tasks are known to vary across human factors. We examine the impact of explicit vs. implicit personas and investigate which human factors LLMs recognize and respond to. Results show that LLM personas show mixed results when reproducing known human biases, but generate generally fail to demonstrate implicit biases. We conclude that LLMs lack the intrinsic cognitive mechanisms of human thought, while capturing the statistical patterns of how people speak, which may restrict their effectiveness in complex social science applications.

6/21/2024

cs.CL

🏷️

Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis

Nikolay B Petrov, Gregory Serapio-Garc'ia, Jason Rentfrow

The humanlike responses of large language models (LLMs) have prompted social scientists to investigate whether LLMs can be used to simulate human participants in experiments, opinion polls and surveys. Of central interest in this line of research has been mapping out the psychological profiles of LLMs by prompting them to respond to standardized questionnaires. The conflicting findings of this research are unsurprising given that mapping out underlying, or latent, traits from LLMs' text responses to questionnaires is no easy task. To address this, we use psychometrics, the science of psychological measurement. In this study, we prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs. We used two kinds of persona descriptions: either generic (four or five random person descriptions) or specific (mostly demographics of actual humans from a large-scale human dataset). We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties, similar to human norms, but the data from both LLMs when using specific demographic profiles, show poor psychometrics properties. We conclude that, currently, when LLMs are asked to simulate silicon personas, their responses are poor signals of potentially underlying latent traits. Thus, our work casts doubt on LLMs' ability to simulate individual-level human behaviour across multiple-choice question answering tasks.

5/14/2024

cs.CL cs.AI cs.CY cs.HC

💬

PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits

Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, Jad Kabbara

Despite the many use cases for large language models (LLMs) in creating personalized chatbots, there has been limited research on evaluating the extent to which the behaviors of personalized LLMs accurately and consistently reflect specific personality traits. We consider studying the behavior of LLM-based agents which we refer to as LLM personas and present a case study with GPT-3.5 and GPT-4 to investigate whether LLMs can generate content that aligns with their assigned personality profiles. To this end, we simulate distinct LLM personas based on the Big Five personality model, have them complete the 44-item Big Five Inventory (BFI) personality test and a story writing task, and then assess their essays with automatic and human evaluations. Results show that LLM personas' self-reported BFI scores are consistent with their designated personality types, with large effect sizes observed across five traits. Additionally, LLM personas' writings have emerging representative linguistic patterns for personality traits when compared with a human writing corpus. Furthermore, human evaluation shows that humans can perceive some personality traits with an accuracy of up to 80%. Interestingly, the accuracy drops significantly when the annotators were informed of AI authorship.

4/3/2024

cs.CL cs.AI cs.HC

Are Large Language Models Chameleons?

Mingmeng Geng, Sihong He, Roberto Trotta

Do large language models (LLMs) have their own worldviews and personality tendencies? Simulations in which an LLM was asked to answer subjective questions were conducted more than 1 million times. Comparison of the responses from different LLMs with real data from the European Social Survey (ESS) suggests that the effect of prompts on bias and variability is fundamental, highlighting major cultural, age, and gender biases. Methods for measuring the difference between LLMs and survey data are discussed, such as calculating weighted means and a new proposed measure inspired by Jaccard similarity. We conclude that it is important to analyze the robustness and variability of prompts before using LLMs to model individual decisions or collective behavior, as their imitation abilities are approximate at best.

5/30/2024

cs.CL cs.AI cs.CY cs.LG