Inclusivity in Large Language Models: Personality Traits and Gender Bias in Scientific Abstracts

2406.19497

Published 7/1/2024 by Naseela Pervez, Alexander J. Titus

Abstract

Large language models (LLMs) are increasingly utilized to assist in scientific and academic writing, helping authors enhance the coherence of their articles. Previous studies have highlighted stereotypes and biases present in LLM outputs, emphasizing the need to evaluate these models for their alignment with human narrative styles and potential gender biases. In this study, we assess the alignment of three prominent LLMs - Claude 3 Opus, Mistral AI Large, and Gemini 1.5 Flash - by analyzing their performance on benchmark text-generation tasks for scientific abstracts. We employ the Linguistic Inquiry and Word Count (LIWC) framework to extract lexical, psychological, and social features from the generated texts. Our findings indicate that, while these models generally produce text closely resembling human authored content, variations in stylistic features suggest significant gender biases. This research highlights the importance of developing LLMs that maintain a diversity of writing styles to promote inclusivity in academic discourse.

Create account to get full access

Overview

The paper investigates the personality traits and gender bias exhibited in the language of scientific abstracts generated by large language models (LLMs).
It examines the ability of LLMs to generate text with inclusive and diverse personality traits, and analyzes the presence of gender biases in the generated text.
The study uses the Linguistic Inquiry and Word Count (LIWC) tool to analyze the personality traits and gender biases in the abstracts.

Plain English Explanation

This research paper explores the characteristics of the language used in scientific abstracts that are generated by large language models (LLMs) - powerful AI systems that can produce human-like text. Specifically, the researchers are interested in understanding whether these LLMs can generate text that exhibits a diverse range of personality traits, and whether there are any gender biases present in the language they produce.

To do this, the researchers use a tool called Linguistic Inquiry and Word Count (LIWC) to analyze the language in the generated abstracts. LIWC can identify different personality traits, such as emotional tone, analytical thinking, and social engagement, based on the words used. The researchers then look for patterns in these traits and compare them to see if there are any gender-based differences.

The findings from this study can help us better understand the capabilities and limitations of large language models when it comes to producing inclusive and unbiased text. This is an important consideration as these models are increasingly being used in various applications, from text generation to content creation. By identifying potential biases, researchers and developers can work to improve the fairness and inclusivity of these powerful AI systems.

Technical Explanation

The researchers used a dataset of scientific abstracts generated by various large language models (LLMs) to analyze the personality traits and gender biases present in the text. They employed the Linguistic Inquiry and Word Count (LIWC) tool, which is a widely used software for analyzing the psychological and linguistic properties of written text.

The LIWC analysis examined various dimensions of personality, including emotional tone, analytical thinking, clout (social status and confidence), authenticity, and linguistic style. The researchers then compared the personality trait profiles of abstracts generated by different LLMs, as well as between abstracts attributed to male and female authors.

The results showed that the LLMs exhibited a range of personality traits in the generated abstracts, with some models producing text that was more analytical, while others generated content that was more emotional or socially engaged. However, the researchers also found evidence of gender biases, with abstracts attributed to male authors generally displaying higher levels of analytical thinking and clout, while those attributed to female authors tended to be more authentic and emotionally expressive.

These findings suggest that while LLMs can generate text with diverse personality characteristics, they may also perpetuate certain gender-based linguistic stereotypes and biases. The researchers highlight the importance of addressing these biases to ensure that the language produced by these powerful AI systems is inclusive and representative of the full range of human diversity.

Critical Analysis

The researchers provide a thorough and methodological investigation of the personality traits and gender biases present in the language generated by large language models. The use of the Linguistic Inquiry and Word Count (LIWC) tool is a well-established approach for analyzing the psychological and linguistic properties of text, which lends credibility to the study's findings.

However, it's important to note that the research is limited to the analysis of scientific abstracts, which may not be fully representative of the full range of text that LLMs can generate. Additionally, the study does not delve into the specific mechanisms or biases within the language models that may be contributing to the observed gender differences. Further research would be needed to understand the underlying causes and develop strategies to mitigate these biases.

It's also worth considering the potential impact of these findings on the real-world applications of LLMs, such as content creation and STEM education. The presence of gender biases in the language generated by these models could have significant implications for how they are deployed and used, and highlights the need for continued efforts to ensure the fairness and inclusivity of these powerful AI systems.

Conclusion

This research paper provides valuable insights into the personality traits and gender biases exhibited in the language generated by large language models. The findings suggest that while LLMs can produce text with diverse personality characteristics, they may also perpetuate certain gender-based linguistic stereotypes and biases.

These insights are particularly important as LLMs become more widely used in various applications, from text generation to content creation. By understanding the potential biases and limitations of these models, researchers and developers can work to improve their fairness and inclusivity, ensuring that the language they produce is representative of the full range of human diversity and experiences.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Evaluation of Large Language Models: STEM education and Gender Stereotypes

Smilla Due, Sneha Das, Marianne Andersen, Berta Plandolit L'opez, Sniff Andersen Nex{o}, Line Clemmensen

Large Language Models (LLMs) have an increasing impact on our lives with use cases such as chatbots, study support, coding support, ideation, writing assistance, and more. Previous studies have revealed linguistic biases in pronouns used to describe professions or adjectives used to describe men vs women. These issues have to some degree been addressed in updated LLM versions, at least to pass existing tests. However, biases may still be present in the models, and repeated use of gender stereotypical language may reinforce the underlying assumptions and are therefore important to examine further. This paper investigates gender biases in LLMs in relation to educational choices through an open-ended, true to user-case experimental design and a quantitative analysis. We investigate the biases in the context of four different cultures, languages, and educational systems (English/US/UK, Danish/DK, Catalan/ES, and Hindi/IN) for ages ranging from 10 to 16 years, corresponding to important educational transition points in the different countries. We find that there are significant and large differences in the ratio of STEM to non-STEM suggested education paths provided by chatGPT when using typical girl vs boy names to prompt lists of suggested things to become. There are generally fewer STEM suggestions in the Danish, Spanish, and Indian context compared to the English. We also find subtle differences in the suggested professions, which we categorise and report.

6/17/2024

cs.CL cs.AI

💬

PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits

Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, Jad Kabbara

Despite the many use cases for large language models (LLMs) in creating personalized chatbots, there has been limited research on evaluating the extent to which the behaviors of personalized LLMs accurately and consistently reflect specific personality traits. We consider studying the behavior of LLM-based agents which we refer to as LLM personas and present a case study with GPT-3.5 and GPT-4 to investigate whether LLMs can generate content that aligns with their assigned personality profiles. To this end, we simulate distinct LLM personas based on the Big Five personality model, have them complete the 44-item Big Five Inventory (BFI) personality test and a story writing task, and then assess their essays with automatic and human evaluations. Results show that LLM personas' self-reported BFI scores are consistent with their designated personality types, with large effect sizes observed across five traits. Additionally, LLM personas' writings have emerging representative linguistic patterns for personality traits when compared with a human writing corpus. Furthermore, human evaluation shows that humans can perceive some personality traits with an accuracy of up to 80%. Interestingly, the accuracy drops significantly when the annotators were informed of AI authorship.

4/3/2024

cs.CL cs.AI cs.HC

💬

Assessing the nature of large language models: A caution against anthropocentrism

Ann Speed

Generative AI models garnered a large amount of public attention and speculation with the release of OpenAIs chatbot, ChatGPT. At least two opinion camps exist: one excited about possibilities these models offer for fundamental changes to human tasks, and another highly concerned about power these models seem to have. To address these concerns, we assessed several LLMs, primarily GPT 3.5, using standard, normed, and validated cognitive and personality measures. For this seedling project, we developed a battery of tests that allowed us to estimate the boundaries of some of these models capabilities, how stable those capabilities are over a short period of time, and how they compare to humans. Our results indicate that LLMs are unlikely to have developed sentience, although its ability to respond to personality inventories is interesting. GPT3.5 did display large variability in both cognitive and personality measures over repeated observations, which is not expected if it had a human-like personality. Variability notwithstanding, LLMs display what in a human would be considered poor mental health, including low self-esteem, marked dissociation from reality, and in some cases narcissism and psychopathy, despite upbeat and helpful responses.

6/28/2024

cs.AI cs.CL cs.CY cs.HC

💬

Bias of AI-Generated Content: An Examination of News Produced by Large Language Models

Xiao Fang, Shangkun Che, Minjia Mao, Hongzhe Zhang, Ming Zhao, Xiaohang Zhao

Large language models (LLMs) have the potential to transform our lives and work through the content they generate, known as AI-Generated Content (AIGC). To harness this transformation, we need to understand the limitations of LLMs. Here, we investigate the bias of AIGC produced by seven representative LLMs, including ChatGPT and LLaMA. We collect news articles from The New York Times and Reuters, both known for their dedication to provide unbiased news. We then apply each examined LLM to generate news content with headlines of these news articles as prompts, and evaluate the gender and racial biases of the AIGC produced by the LLM by comparing the AIGC and the original news articles. We further analyze the gender bias of each LLM under biased prompts by adding gender-biased messages to prompts constructed from these news headlines. Our study reveals that the AIGC produced by each examined LLM demonstrates substantial gender and racial biases. Moreover, the AIGC generated by each LLM exhibits notable discrimination against females and individuals of the Black race. Among the LLMs, the AIGC generated by ChatGPT demonstrates the lowest level of bias, and ChatGPT is the sole model capable of declining content generation when provided with biased prompts.

4/5/2024

cs.AI