Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement?

Read original: arXiv:2310.11616 - Published 9/12/2024 by David Ili'c, Gilles E. Gignac

Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement?

Overview

Examines the general intelligence factor, known as the "g factor," in large language models
Employs psychometric methods to analyze the underlying cognitive abilities of language models
Investigates whether language models exhibit a general intelligence factor similar to that observed in humans

Plain English Explanation

The paper explores the concept of general intelligence, or the "g factor," as it applies to large language models. The g factor refers to a single, overarching cognitive ability that underlies various mental skills in humans. The researchers used psychometric techniques, which are tools commonly used to measure and analyze intelligence in people, to investigate whether language models also exhibit a general intelligence factor.

This is an important question because language models have become increasingly sophisticated and capable of performing a wide range of tasks, leading some to wonder if they possess general intelligence akin to humans. By applying psychometric methods, the researchers aimed to provide insight into the nature of the cognitive abilities underpinning language model performance.

Technical Explanation

The study used factor analysis, a statistical technique, to examine the interrelated cognitive-like capabilities of large language models across various benchmark tasks. The researchers hypothesized that if language models exhibit a general intelligence factor, it would be reflected in the emergence of a single, dominant factor that explains a significant portion of the variance in their performance across tasks.

To test this, the researchers collected performance data on a diverse set of language model benchmarks, including tasks related to natural language understanding, reasoning, and common sense. They then applied factor analysis to this data to identify the underlying factors that account for the observed performance patterns.

Critical Analysis

The paper provides a rigorous and insightful examination of the general intelligence factor in language models. However, the researchers acknowledge certain limitations and areas for further exploration. For instance, the study focused on a limited set of language model benchmarks, and it remains to be seen how the findings would extend to a broader range of tasks and capabilities.

Additionally, the researchers note that the g factor observed in language models may not be directly analogous to the g factor in humans, as the underlying cognitive mechanisms and the nature of intelligence in artificial systems are not fully understood. More research is needed to better characterize the general intelligence of language models and its implications for the development of artificial general intelligence (AGI).

Conclusion

This paper provides important insights into the nature of intelligence in large language models. By demonstrating the presence of a general intelligence factor, the study suggests that language models possess cognitive-like capabilities that exhibit some similarities to human intelligence. However, the researchers caution that further research is needed to fully understand the implications of these findings and their significance for the broader field of artificial intelligence.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement?

David Ili'c, Gilles E. Gignac

Large language models (LLMs) are advanced artificial intelligence (AI) systems that can perform a variety of tasks commonly found in human intelligence tests, such as defining words, performing calculations, and engaging in verbal reasoning. There are also substantial individual differences in LLM capacities. Given the consistent observation of a positive manifold and general intelligence factor in human samples, along with group-level factors (e.g., crystallized intelligence), we hypothesized that LLM test scores may also exhibit positive intercorrelations, which could potentially give rise to an artificial general ability (AGA) factor and one or more group-level factors. Based on a sample of 591 LLMs and scores from 12 tests aligned with fluid reasoning (Gf), domain-specific knowledge (Gkn), reading/writing (Grw), and quantitative knowledge (Gq), we found strong empirical evidence for a positive manifold and a general factor of ability. Additionally, we identified a combined Gkn/Grw group-level factor. Finally, the number of LLM parameters correlated positively with both general factor of ability and Gkn/Grw factor scores, although the effects showed diminishing returns. We interpreted our results to suggest that LLMs, like human cognitive abilities, may share a common underlying efficiency in processing information and solving problems, though whether LLMs manifest primarily achievement/expertise rather than intelligence remains to be determined. Finally, while models with greater numbers of parameters exhibit greater general cognitive-like abilities, akin to the connection between greater neuronal density and human general intelligence, other characteristics must also be involved.

9/12/2024

💬

A Perspective on Large Language Models, Intelligent Machines, and Knowledge Acquisition

Vladimir Cherkassky, Eng Hock Lee

Large Language Models (LLMs) are known for their remarkable ability to generate synthesized 'knowledge', such as text documents, music, images, etc. However, there is a huge gap between LLM's and human capabilities for understanding abstract concepts and reasoning. We discuss these issues in a larger philosophical context of human knowledge acquisition and the Turing test. In addition, we illustrate the limitations of LLMs by analyzing GPT-4 responses to questions ranging from science and math to common sense reasoning. These examples show that GPT-4 can often imitate human reasoning, even though it lacks understanding. However, LLM responses are synthesized from a large LLM model trained on all available data. In contrast, human understanding is based on a small number of abstract concepts. Based on this distinction, we discuss the impact of LLMs on acquisition of human knowledge and education.

8/14/2024

How to Measure the Intelligence of Large Language Models?

Nils Korber, Silvan Wehrli, Christopher Irrgang

With the release of ChatGPT and other large language models (LLMs) the discussion about the intelligence, possibilities, and risks, of current and future models have seen large attention. This discussion included much debated scenarios about the imminent rise of so-called super-human AI, i.e., AI systems that are orders of magnitude smarter than humans. In the spirit of Alan Turing, there is no doubt that current state-of-the-art language models already pass his famous test. Moreover, current models outperform humans in several benchmark tests, so that publicly available LLMs have already become versatile companions that connect everyday life, industry and science. Despite their impressive capabilities, LLMs sometimes fail completely at tasks that are thought to be trivial for humans. In other cases, the trustworthiness of LLMs becomes much more elusive and difficult to evaluate. Taking the example of academia, language models are capable of writing convincing research articles on a given topic with only little input. Yet, the lack of trustworthiness in terms of factual consistency or the existence of persistent hallucinations in AI-generated text bodies has led to a range of restrictions for AI-based content in many scientific journals. In view of these observations, the question arises as to whether the same metrics that apply to human intelligence can also be applied to computational methods and has been discussed extensively. In fact, the choice of metrics has already been shown to dramatically influence assessments on potential intelligence emergence. Here, we argue that the intelligence of LLMs should not only be assessed by task-specific statistical metrics, but separately in terms of qualitative and quantitative measures.

7/31/2024

🧪

Testing AI on language comprehension tasks reveals insensitivity to underlying meaning

Vittoria Dentella, Fritz Guenther, Elliot Murphy, Gary Marcus, Evelina Leivada

Large Language Models (LLMs) are recruited in applications that span from clinical assistance and legal support to question answering and education. Their success in specialized tasks has led to the claim that they possess human-like linguistic capabilities related to compositional understanding and reasoning. Yet, reverse-engineering is bound by Moravec's Paradox, according to which easy skills are hard. We systematically assess 7 state-of-the-art models on a novel benchmark. Models answered a series of comprehension questions, each prompted multiple times in two settings, permitting one-word or open-length replies. Each question targets a short text featuring high-frequency linguistic constructions. To establish a baseline for achieving human-like performance, we tested 400 humans on the same prompts. Based on a dataset of n=26,680 datapoints, we discovered that LLMs perform at chance accuracy and waver considerably in their answers. Quantitatively, the tested models are outperformed by humans, and qualitatively their answers showcase distinctly non-human errors in language understanding. We interpret this evidence as suggesting that, despite their usefulness in various tasks, current AI models fall short of understanding language in a way that matches humans, and we argue that this may be due to their lack of a compositional operator for regulating grammatical and semantic information.

7/10/2024