Development of Cognitive Intelligence in Pre-trained Language Models

Read original: arXiv:2407.01047 - Published 7/15/2024 by Raj Sanjay Shah, Khushi Bhardwaj, Sashank Varma

Development of Cognitive Intelligence in Pre-trained Language Models

Overview

This paper introduces "ThinkTank," a new benchmark for evaluating the cognitive intelligence of pre-trained language models.
The benchmark is designed to assess a model's ability to reason, plan, and make inferences, going beyond traditional language understanding tasks.
The researchers evaluate several popular language models on the ThinkTank benchmark and provide insights into their strengths, limitations, and potential for cognitive intelligence.

Plain English Explanation

The paper presents a new way to test how intelligent large language models (LLMs) like GPT-3 or BERT really are. Instead of just measuring how well they can understand and generate language, the "ThinkTank" benchmark looks at their ability to reason, plan, and draw conclusions - the kinds of cognitive skills humans use.

The researchers ran several popular LLMs through this new benchmark and found some interesting results. While the models performed well on traditional language tasks, they struggled with the more complex cognitive challenges in ThinkTank. This suggests that even the most advanced LLMs today may not truly exhibit the same level of general intelligence as the human mind.

By developing this new evaluation tool, the researchers hope to push the field of language AI towards models that can not only understand language, but also engage in deeper, more human-like reasoning and problem-solving. This could lead to breakthroughs in areas like artificial general intelligence and help us better understand the nature of cognitive intelligence in large language models.

Technical Explanation

The paper introduces "ThinkTank," a new benchmark for evaluating the cognitive intelligence of pre-trained language models. Unlike traditional language understanding tasks, ThinkTank is designed to assess a model's ability to reason, plan, and make inferences.

The benchmark consists of a diverse set of cognitive tasks, including logical reasoning, abstract problem-solving, and task planning. These tasks are structured to require a combination of language understanding, reasoning, and decision-making skills.

The researchers evaluated several popular language models, including GPT-3, BERT, and T5, on the ThinkTank benchmark. They found that while the models performed well on standard language tasks, they struggled with the more complex cognitive challenges in ThinkTank. This suggests that even the most advanced LLMs today may not truly exhibit the same level of general intelligence as the human mind.

The paper also discusses the implications of these findings for the field of artificial general intelligence and the need to develop language models that can engage in deeper, more human-like reasoning and problem-solving. The researchers argue that the ThinkTank benchmark provides a valuable tool for assessing the nature of cognitive intelligence in large language models and guiding future research in this direction.

Critical Analysis

The paper makes a compelling case for the need to move beyond traditional language understanding tasks and develop more comprehensive benchmarks for evaluating the cognitive intelligence of pre-trained language models. The ThinkTank benchmark represents an important step in this direction, providing a diverse set of cognitive challenges that go beyond simple language processing.

However, the paper also acknowledges several limitations of the current study. The researchers note that the benchmark may not capture the full range of cognitive abilities exhibited by humans, and that further refinement and expansion of the tasks may be necessary. Additionally, the evaluation was limited to a relatively small set of language models, and it would be valuable to see how a wider range of models perform on the benchmark.

Another potential concern is the degree to which the ThinkTank tasks truly reflect the kind of cognitive intelligence exhibited by humans. While the tasks are designed to be challenging and require reasoning and planning, it's possible that they still fall short of capturing the full complexity and flexibility of human cognition. Exploring the connections between language models and human cognitive processes may be a fruitful area for further research.

Despite these limitations, the ThinkTank benchmark represents an important step forward in the ongoing efforts to develop more comprehensive and meaningful evaluations of language model capabilities. As the field of artificial intelligence continues to make progress, it will be crucial to have tools like ThinkTank that can help us better understand the strengths, limitations, and potential of these models.

Conclusion

The paper introduces a new benchmark, "ThinkTank," for evaluating the cognitive intelligence of pre-trained language models. Unlike traditional language understanding tasks, ThinkTank is designed to assess a model's ability to reason, plan, and make inferences.

The researchers found that while popular language models like GPT-3 and BERT performed well on standard language tasks, they struggled with the more complex cognitive challenges in ThinkTank. This suggests that even the most advanced LLMs today may not truly exhibit the same level of general intelligence as the human mind.

By developing the ThinkTank benchmark, the researchers hope to push the field of language AI towards models that can engage in deeper, more human-like reasoning and problem-solving. This could lead to breakthroughs in areas like artificial general intelligence and help us better understand the nature of cognitive intelligence in large language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Development of Cognitive Intelligence in Pre-trained Language Models

Raj Sanjay Shah, Khushi Bhardwaj, Sashank Varma

Recent studies show evidence for emergent cognitive abilities in Large Pre-trained Language Models (PLMs). The increasing cognitive alignment of these models has made them candidates for cognitive science theories. Prior research into the emergent cognitive abilities of PLMs has largely been path-independent to model training, i.e., has focused on the final model weights and not the intermediate steps. However, building plausible models of human cognition using PLMs would benefit from considering the developmental alignment of their performance during training to the trajectories of children's thinking. Guided by psychometric tests of human intelligence, we choose four sets of tasks to investigate the alignment of ten popular families of PLMs and evaluate their available intermediate and final training steps. These tasks are Numerical ability, Linguistic abilities, Conceptual understanding, and Fluid reasoning. We find a striking regularity: regardless of model size, the developmental trajectories of PLMs consistently exhibit a window of maximal alignment to human cognitive development. Before that window, training appears to endow blank slate models with the requisite structure to be poised to rapidly learn from experience. After that window, training appears to serve the engineering goal of reducing loss but not the scientific goal of increasing alignment with human cognition.

7/15/2024

CogLM: Tracking Cognitive Development of Large Language Models

Xinglin Wang, Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Boyuan Pan, Heda Wang, Yao Hu, Kan Li

Piaget's Theory of Cognitive Development (PTC) posits that the development of cognitive levels forms the foundation for human learning across various abilities. As Large Language Models (LLMs) have recently shown remarkable abilities across a wide variety of tasks, we are curious about the cognitive levels of current LLMs: to what extent they have developed and how this development has been achieved. To this end, we construct a benchmark CogLM (Cognitive Ability Evaluation for Language Model) based on PTC to assess the cognitive levels of LLMs. CogLM comprises 1,220 questions spanning 10 cognitive abilities crafted by more than 20 human experts, providing a comprehensive testbed for the cognitive levels of LLMs. Through extensive experiments across multiple mainstream LLMs with CogLM, we find that: (1) Human-like cognitive abilities have emerged in advanced LLMs (GPT-4), comparable to those of a 20-year-old human. (2) The parameter size and optimization objective are two key factors affecting the cognitive levels of LLMs. (3) The performance on downstream tasks is positively correlated with the level of cognitive abilities. These findings fill the gap in research on the cognitive abilities of LLMs, tracing the development of LLMs from a cognitive perspective and guiding the future direction of their evolution.

8/20/2024

Language Models Trained to do Arithmetic Predict Human Risky and Intertemporal Choice

Jian-Qiao Zhu, Haijiang Yan, Thomas L. Griffiths

The observed similarities in the behavior of humans and Large Language Models (LLMs) have prompted researchers to consider the potential of using LLMs as models of human cognition. However, several significant challenges must be addressed before LLMs can be legitimately regarded as cognitive models. For instance, LLMs are trained on far more data than humans typically encounter, and may have been directly trained on human data in specific cognitive tasks or aligned with human preferences. Consequently, the origins of these behavioral similarities are not well understood. In this paper, we propose a novel way to enhance the utility of LLMs as cognitive models. This approach involves (i) leveraging computationally equivalent tasks that both an LLM and a rational agent need to master for solving a cognitive problem and (ii) examining the specific task distributions required for an LLM to exhibit human-like behaviors. We apply this approach to decision-making -- specifically risky and intertemporal choice -- where the key computationally equivalent task is the arithmetic of expected value calculations. We show that an LLM pretrained on an ecologically valid arithmetic dataset, which we call Arithmetic-GPT, predicts human behavior better than many traditional cognitive models. Pretraining LLMs on ecologically valid arithmetic datasets is sufficient to produce a strong correspondence between these models and human decision-making. Our results also suggest that LLMs used as cognitive models should be carefully investigated via ablation studies of the pretraining data.

5/30/2024

Do Large Language Models Mirror Cognitive Language Processing?

Yuqi Ren, Renren Jin, Tongxuan Zhang, Deyi Xiong

Large Language Models (LLMs) have demonstrated remarkable abilities in text comprehension and logical reasoning, indicating that the text representations learned by LLMs can facilitate their language processing capabilities. In cognitive science, brain cognitive processing signals are typically utilized to study human language processing. Therefore, it is natural to ask how well the text embeddings from LLMs align with the brain cognitive processing signals, and how training strategies affect the LLM-brain alignment? In this paper, we employ Representational Similarity Analysis (RSA) to measure the alignment between 23 mainstream LLMs and fMRI signals of the brain to evaluate how effectively LLMs simulate cognitive language processing. We empirically investigate the impact of various factors (e.g., pre-training data size, model scaling, alignment training, and prompts) on such LLM-brain alignment. Experimental results indicate that pre-training data size and model scaling are positively correlated with LLM-brain similarity, and alignment training can significantly improve LLM-brain similarity. Explicit prompts contribute to the consistency of LLMs with brain cognitive language processing, while nonsensical noisy prompts may attenuate such alignment. Additionally, the performance of a wide range of LLM evaluations (e.g., MMLU, Chatbot Arena) is highly correlated with the LLM-brain similarity.

5/29/2024