Psychometric Predictive Power of Large Language Models

2311.07484

Published 4/16/2024 by Tatsuki Kuribayashi, Yohei Oseki, Timothy Baldwin

💬

Abstract

Instruction tuning aligns the response of large language models (LLMs) with human preferences. Despite such efforts in human--LLM alignment, we find that instruction tuning does not always make LLMs human-like from a cognitive modeling perspective. More specifically, next-word probabilities estimated by instruction-tuned LLMs are often worse at simulating human reading behavior than those estimated by base LLMs. In addition, we explore prompting methodologies for simulating human reading behavior with LLMs. Our results show that prompts reflecting a particular linguistic hypothesis improve psychometric predictive power, but are still inferior to small base models. These findings highlight that recent advancements in LLMs, i.e., instruction tuning and prompting, do not offer better estimates than direct probability measurements from base LLMs in cognitive modeling. In other words, pure next-word probability remains a strong predictor for human reading behavior, even in the age of LLMs.

Create account to get full access

Overview

Instruction tuning aligns large language models (LLMs) with human preferences, but does not always make them more human-like
LLMs tuned with instructions often perform worse than base LLMs at simulating human reading behavior
Prompting methodologies also fail to improve performance over base LLMs in cognitive modeling

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can generate human-like text. Researchers have been working to "tune" these models to better align with human preferences and behaviors. One approach, called instruction tuning, aims to make the models behave more like humans.

However, this research found that instruction tuning doesn't always achieve that goal. The next-word probabilities estimated by instruction-tuned LLMs were often worse at simulating how humans actually read and process language, compared to the base LLMs before tuning.

The researchers also explored using different prompting techniques to get the LLMs to better model human reading. But even the best prompting methods still couldn't match the performance of the smaller, base LLMs when it came to predicting human reading behavior.

These findings suggest that, despite recent advancements, the basic probability calculations in the original LLMs remain a stronger predictor of human language processing than the more sophisticated, tuned models. In other words, pure next-word probability is still a powerful tool for understanding how humans read and comprehend language.

Technical Explanation

This paper investigates the relationship between instruction tuning of large language models (LLMs) and their ability to model human cognitive processes, specifically human reading behavior.

The researchers compared the next-word probability estimates of instruction-tuned LLMs to those of base LLMs, using a metric called perplexity (PPP) to measure how well the models simulated human reading. Interestingly, they found that instruction tuning often degraded the models' performance on this cognitive modeling task, with instruction-tuned LLMs showing higher PPP (worse performance) than the base LLMs.

The researchers also explored using different prompting methodologies to improve the LLMs' ability to model human reading. They found that prompts designed to reflect specific linguistic hypotheses could improve PPP, but still fell short of the performance of the smaller base LLMs.

These results highlight that recent advancements in LLMs, like instruction tuning and sophisticated prompting, do not necessarily translate to better cognitive modeling capabilities. The fundamental next-word probability estimates from the base LLMs appear to be a stronger predictor of human reading behavior than the more complex, tuned models.

Critical Analysis

The paper provides a nuanced and thought-provoking look at the relationship between instruction tuning of LLMs and their ability to model human cognitive processes. The finding that instruction tuning can actually degrade performance on cognitive modeling tasks is particularly intriguing and runs counter to the general narrative around the benefits of such tuning.

One potential limitation is that the study focuses solely on next-word probability and perplexity as a proxy for human reading behavior. While these metrics are widely used in the field, they may not capture the full complexity of human language processing. Expanding the analysis to include other cognitive modeling tasks could provide a more comprehensive understanding of the models' capabilities.

Additionally, the paper does not delve deeply into the reasons why instruction tuning might degrade cognitive modeling performance. Further investigation into the underlying mechanisms and architectural differences between base and tuned LLMs could shed light on this phenomenon and guide future model development.

Overall, this research challenges the assumption that making LLMs more "human-like" through instruction tuning necessarily leads to better cognitive modeling. It raises important questions about the relationship between language model capabilities and human-like behavior, and encourages a critical examination of the strengths and limitations of current LLM approaches.

Conclusion

This paper presents a thought-provoking exploration of the relationship between instruction tuning of large language models (LLMs) and their ability to model human cognitive processes, specifically human reading behavior. The key finding - that instruction tuning does not always improve, and can even degrade, an LLM's performance on cognitive modeling tasks - challenges the common assumption that making these models more "human-like" is always beneficial.

The research highlights the importance of looking beyond just improving language generation and considering how model advancements translate to better simulations of human cognition. It suggests that the fundamental next-word probability estimates in base LLMs may be a stronger predictor of human reading behavior than the more sophisticated, tuned models.

These insights have important implications for the development of language models and their application in cognitive science and psychology. They encourage researchers to critically examine the connection between language model capabilities and human-like behavior, and to explore alternative paths forward in aligning AI systems with human cognition.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Optimizing Psychological Counseling with Instruction-Tuned Large Language Models

Wenjie Li, Tianyu Sun, Kun Qian, Wenhong Wang

The advent of large language models (LLMs) has significantly advanced various fields, including natural language processing and automated dialogue systems. This paper explores the application of LLMs in psychological counseling, addressing the increasing demand for mental health services. We present a method for instruction tuning LLMs with specialized prompts to enhance their performance in providing empathetic, relevant, and supportive responses. Our approach involves developing a comprehensive dataset of counseling-specific prompts, refining them through feedback from professional counselors, and conducting rigorous evaluations using both automatic metrics and human assessments. The results demonstrate that our instruction-tuned model outperforms several baseline LLMs, highlighting its potential as a scalable and accessible tool for mental health support.

6/21/2024

cs.CL cs.AI

💬

From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning

Xuansheng Wu, Wenlin Yao, Jianshu Chen, Xiaoman Pan, Xiaoyang Wang, Ninghao Liu, Dong Yu

Large Language Models (LLMs) have achieved remarkable success, where instruction tuning is the critical step in aligning LLMs with user intentions. In this work, we investigate how the instruction tuning adjusts pre-trained models with a focus on intrinsic changes. Specifically, we first develop several local and global explanation methods, including a gradient-based method for input-output attribution, and techniques for interpreting patterns and concepts in self-attention and feed-forward layers. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. This approach provides an internal perspective of the model shifts on a human-comprehensible level. Our findings reveal three significant impacts of instruction tuning: 1) It empowers LLMs to recognize the instruction parts of user prompts, and promotes the response generation constantly conditioned on the instructions. 2) It encourages the self-attention heads to capture more word-word relationships about instruction verbs. 3) It encourages the feed-forward networks to rotate their pre-trained knowledge toward user-oriented tasks. These insights contribute to a more comprehensive understanding of instruction tuning and lay the groundwork for future work that aims at explaining and optimizing LLMs for various applications. Our code and data are publicly available at https://github.com/JacksonWuxs/Interpret_Instruction_Tuning_LLMs.

4/5/2024

cs.CL cs.AI cs.LG

🤿

Bayesian Statistical Modeling with Predictors from LLMs

Michael Franke, Polina Tsvilodub, Fausto Carcassi

State of the art large language models (LLMs) have shown impressive performance on a variety of benchmark tasks and are increasingly used as components in larger applications, where LLM-based predictions serve as proxies for human judgements or decision. This raises questions about the human-likeness of LLM-derived information, alignment with human intuition, and whether LLMs could possibly be considered (parts of) explanatory models of (aspects of) human cognition or language use. To shed more light on these issues, we here investigate the human-likeness of LLMs' predictions for multiple-choice decision tasks from the perspective of Bayesian statistical modeling. Using human data from a forced-choice experiment on pragmatic language use, we find that LLMs do not capture the variance in the human data at the item-level. We suggest different ways of deriving full distributional predictions from LLMs for aggregate, condition-level data, and find that some, but not all ways of obtaining condition-level predictions yield adequate fits to human data. These results suggests that assessment of LLM performance depends strongly on seemingly subtle choices in methodology, and that LLMs are at best predictors of human behavior at the aggregate, condition-level, for which they are, however, not designed to, or usually used to, make predictions in the first place.

6/14/2024

cs.CL

I Learn Better If You Speak My Language: Understanding the Superior Performance of Fine-Tuning Large Language Models with LLM-Generated Responses

Xuan Ren, Biao Wu, Lingqiao Liu

This paper explores an intriguing observation: fine-tuning a large language model (LLM) with responses generated by a LLM often yields better results than using responses generated by humans. We conduct an in-depth investigation to understand why this occurs. Contrary to the common belief that these instances is simply due to the more detailed nature of LLM-generated content, our study identifies another contributing factor: an LLM is inherently more familiar with LLM generated responses. This familiarity is evidenced by lower perplexity before fine-tuning. We design a series of experiments to understand the impact of the familiarity and our conclusion reveals that this familiarity significantly impacts learning performance. Training with LLM-generated responses not only enhances performance but also helps maintain the model's capabilities in other tasks after fine-tuning on a specific task.

6/4/2024

cs.CL cs.AI