Scaling Laws for Fact Memorization of Large Language Models

2406.15720

Published 6/26/2024 by Xingyu Lu, Xiaonan Li, Qinyuan Cheng, Kai Ding, Xuanjing Huang, Xipeng Qiu

Scaling Laws for Fact Memorization of Large Language Models

Abstract

Fact knowledge memorization is crucial for Large Language Models (LLM) to generate factual and reliable responses. However, the behaviors of LLM fact memorization remain under-explored. In this paper, we analyze the scaling laws for LLM's fact knowledge and LLMs' behaviors of memorizing different types of facts. We find that LLMs' fact knowledge capacity has a linear and negative exponential law relationship with model size and training epochs, respectively. Estimated by the built scaling law, memorizing the whole Wikidata's facts requires training an LLM with 1000B non-embed parameters for 100 epochs, suggesting that using LLMs to memorize all public facts is almost implausible for a general pre-training setting. Meanwhile, we find that LLMs can generalize on unseen fact knowledge and its scaling law is similar to general pre-training. Additionally, we analyze the compatibility and preference of LLMs' fact memorization. For compatibility, we find LLMs struggle with memorizing redundant facts in a unified way. Only when correlated facts have the same direction and structure, the LLM can compatibly memorize them. This shows the inefficiency of LLM memorization for redundant facts. For preference, the LLM pays more attention to memorizing more frequent and difficult facts, and the subsequent facts can overwrite prior facts' memorization, which significantly hinders low-frequency facts memorization. Our findings reveal the capacity and characteristics of LLMs' fact knowledge learning, which provide directions for LLMs' fact knowledge augmentation.

Create account to get full access

Overview

This paper investigates how the fact memorization capacity of large language models (LLMs) scales with their size and training epochs.
The researchers aim to answer the research question: "How does an LLM's fact knowledge capacity scale with its size and training epochs?"
They present a series of experiments to understand the relationship between LLM size, training time, and factual knowledge recall.

Plain English Explanation

The researchers wanted to understand how the amount of factual knowledge an LLM can remember changes as the model gets bigger and is trained for longer. They conducted experiments to measure how well different sized LLMs could recall specific facts after being trained for different amounts of time.

The key finding is that an LLM's ability to memorize facts scales as a power law with its size and training time. This means that as an LLM gets larger and is trained for longer, its factual knowledge capacity increases, but the rate of increase slows down over time.

For example, doubling the size of an LLM might lead to a 50% increase in its fact recall ability, while doubling the training time might only lead to a 20% increase. This shows there are diminishing returns as LLMs get larger and train for longer.

Understanding these scaling laws is important because it can help predict the factual knowledge capacity of future, even larger LLMs. It also highlights the limitations of relying on ever-bigger LLMs to achieve perfect fact recall, and suggests the need for complementary approaches to imbuing LLMs with accurate and comprehensive knowledge.

Technical Explanation

The researchers conducted a series of experiments to measure the fact memorization capacity of LLMs as a function of model size and training epochs. They used a dataset of 100,000 general knowledge facts and trained LLMs of varying sizes (from 125M to 175B parameters) for different numbers of training epochs.

To assess fact recall, they queried the trained LLMs with the 100,000 facts and measured the proportion of facts the model could correctly reproduce. Their analysis showed that fact memorization capacity scales as a power law with both model size and training time.

Specifically, they found that doubling an LLM's size leads to a 50% increase in fact recall, while doubling the training time leads to only a 20% increase. This demonstrates a form of diminishing returns, where large increases in model size and training are required to achieve incremental gains in factual knowledge capacity.

The researchers also found that the rate of knowledge acquisition slows over time, with the majority of facts being learned in the early stages of training. This suggests there may be inherent limits to how much factual knowledge can be encoded in the weights of a language model.

Critical Analysis

The researchers acknowledge several limitations to their study. First, the 100,000 fact dataset, while substantial, may not be comprehensive enough to fully capture the breadth of knowledge an LLM can acquire. More holistic knowledge evaluation may be needed.

Additionally, the experiments only tested the LLMs' ability to memorize and reproduce isolated facts, not their ability to reason about or apply that knowledge. The scaling laws observed may not extend to more complex forms of knowledge utilization.

Finally, the researchers note that the power law scaling they observed could break down for even larger LLMs or with alternative architectural approaches. Continued research will be needed to understand the limits and potential of LLMs' factual knowledge capacity.

Conclusion

This paper provides important insights into the scaling properties of large language models' factual knowledge capacity. The key finding is that an LLM's fact memorization ability grows as a power law with its size and training time, but with diminishing returns.

This suggests there may be inherent limits to how much factual knowledge can be encoded in the weights of a language model. As LLMs continue to grow larger, alternative approaches may be needed to imbue them with comprehensive and accurate knowledge.

Understanding these scaling laws can help researchers and developers better predict and plan for the factual knowledge capabilities of future LLMs. It also highlights the need for more holistic evaluations of LLM knowledge and the exploration of complementary knowledge acquisition techniques.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

Zeyuan Allen-Zhu, Yuanzhi Li

Scaling laws describe the relationship between the size of language models and their capabilities. Unlike prior studies that evaluate a model's capability via loss or benchmarks, we estimate the number of knowledge bits a model stores. We focus on factual knowledge represented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia page. Through multiple controlled datasets, we establish that language models can and only can store 2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications. Consequently, a 7B model can store 14B bits of knowledge, surpassing the English Wikipedia and textbooks combined based on our estimation. More broadly, we present 12 results on how (1) training duration, (2) model architecture, (3) quantization, (4) sparsity constraints such as MoE, and (5) data signal-to-noise ratio affect a model's knowledge storage capacity. Notable insights include: * The GPT-2 architecture, with rotary embedding, matches or even surpasses LLaMA/Mistral architectures in knowledge storage, particularly over shorter training durations. This arises because LLaMA/Mistral uses GatedMLP, which is less stable and harder to train. * Prepending training data with domain names (e.g., wikipedia.org) significantly increases a model's knowledge capacity. Language models can autonomously identify and prioritize domains rich in knowledge, optimizing their storage capacity.

4/9/2024

cs.CL cs.AI cs.LG

Temporal Scaling Law for Large Language Models

Yizhe Xiong, Xiansheng Chen, Xin Ye, Hui Chen, Zijia Lin, Haoran Lian, Zhenpeng Su, Jianwei Niu, Guiguang Ding

Recently, Large Language Models (LLMs) have been widely adopted in a wide range of tasks, leading to increasing attention towards the research on how scaling LLMs affects their performance. Existing works, termed Scaling Laws, have discovered that the final test loss of LLMs scales as power-laws with model size, computational budget, and dataset size. However, the temporal change of the test loss of an LLM throughout its pre-training process remains unexplored, though it is valuable in many aspects, such as selecting better hyperparameters textit{directly} on the target LLM. In this paper, we propose the novel concept of Temporal Scaling Law, studying how the test loss of an LLM evolves as the training steps scale up. In contrast to modeling the test loss as a whole in a coarse-grained manner, we break it down and dive into the fine-grained test loss of each token position, and further develop a dynamic hyperbolic-law. Afterwards, we derive the much more precise temporal scaling law by studying the temporal patterns of the parameters in the dynamic hyperbolic-law. Results on both in-distribution (ID) and out-of-distribution (OOD) validation datasets demonstrate that our temporal scaling law accurately predicts the test loss of LLMs across training steps. Our temporal scaling law has broad practical applications. First, it enables direct and efficient hyperparameter selection on the target LLM, such as data mixture proportions. Secondly, viewing the LLM pre-training dynamics from the token position granularity provides some insights to enhance the understanding of LLM pre-training.

6/18/2024

cs.CL

🧠

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

Jiaqing Yuan, Lin Pan, Chung-Wei Hang, Jiang Guo, Jiarong Jiang, Bonan Min, Patrick Ng, Zhiguo Wang

Large language models (LLMs) have shown remarkable performance on a variety of NLP tasks, and are being rapidly adopted in a wide range of use cases. It is therefore of vital importance to holistically evaluate the factuality of their generated outputs, as hallucinations remain a challenging issue. In this work, we focus on assessing LLMs' ability to recall factual knowledge learned from pretraining, and the factors that affect this ability. To that end, we construct FACT-BENCH, a representative benchmark covering 20 domains, 134 property types, 3 answer types, and different knowledge popularity levels. We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses. We observe that instruction-tuning hurts knowledge recall, as pretraining-only models consistently outperform their instruction-tuned counterparts, and positive effects of model scaling, as larger models outperform smaller ones for all model families. However, the best performance from GPT-4 still represents a large gap with the upper-bound. We additionally study the role of in-context exemplars using counterfactual demonstrations, which lead to significant degradation of factual knowledge recall for large models. By further decoupling model known and unknown knowledge, we find the degradation is attributed to exemplars that contradict a model's known knowledge, as well as the number of such exemplars. Lastly, we fine-tune LLaMA-7B in different settings of known and unknown knowledge. In particular, fine-tuning on a model's known knowledge is beneficial, and consistently outperforms fine-tuning on unknown and mixed knowledge. We will make our benchmark publicly available.

4/26/2024

cs.CL cs.AI cs.LG

How Do Large Language Models Acquire Factual Knowledge During Pretraining?

Hoyeon Chang, Jinho Park, Seonghyeon Ye, Sohee Yang, Youngkyung Seo, Du-Seong Chang, Minjoon Seo

Despite the recent observation that large language models (LLMs) can store substantial factual knowledge, there is a limited understanding of the mechanisms of how they acquire factual knowledge through pretraining. This work addresses this gap by studying how LLMs acquire factual knowledge during pretraining. The findings reveal several important insights into the dynamics of factual knowledge acquisition during pretraining. First, counterintuitively, we observe that pretraining on more data shows no significant improvement in the model's capability to acquire and maintain factual knowledge. Next, there is a power-law relationship between training steps and forgetting of memorization and generalization of factual knowledge, and LLMs trained with duplicated training data exhibit faster forgetting. Third, training LLMs with larger batch sizes can enhance the models' robustness to forgetting. Overall, our observations suggest that factual knowledge acquisition in LLM pretraining occurs by progressively increasing the probability of factual knowledge presented in the pretraining data at each step. However, this increase is diluted by subsequent forgetting. Based on this interpretation, we demonstrate that we can provide plausible explanations for recently observed behaviors of LLMs, such as the poor performance of LLMs on long-tail knowledge and the benefits of deduplicating the pretraining corpus.

6/18/2024

cs.CL