Forgetting Order of Continual Learning: Examples That are Learned First are Forgotten Last

2406.09935

Published 6/17/2024 by Guy Hacohen, Tinne Tuytelaars

Forgetting Order of Continual Learning: Examples That are Learned First are Forgotten Last

Abstract

Catastrophic forgetting poses a significant challenge in continual learning, where models often forget previous tasks when trained on new data. Our empirical analysis reveals a strong correlation between catastrophic forgetting and the learning speed of examples: examples learned early are rarely forgotten, while those learned later are more susceptible to forgetting. We demonstrate that replay-based continual learning methods can leverage this phenomenon by focusing on mid-learned examples for rehearsal. We introduce Goldilocks, a novel replay buffer sampling method that filters out examples learned too quickly or too slowly, keeping those learned at an intermediate speed. Goldilocks improves existing continual learning algorithms, leading to state-of-the-art performance across several image classification tasks.

Create account to get full access

Overview

This paper investigates the order in which examples are forgotten in continual learning, where a model is trained on a sequence of tasks.
The key finding is that examples learned earlier in the sequence are forgotten later compared to examples learned later in the sequence.
This phenomenon, called the "forgetting order", has implications for the design of continual learning systems.

Plain English Explanation

When machine learning models are trained on a sequence of tasks, they can struggle to remember information from earlier tasks as they learn new ones. This is known as the "catastrophic forgetting" problem in continual learning.

The authors of this paper looked closely at the order in which a model forgets the examples it has learned. Surprisingly, they found that the examples the model learned first are actually the last to be forgotten, while the examples it learned most recently are forgotten first.

This "forgetting order" phenomenon is important because it suggests that the way a continual learning system is designed can significantly impact what information is retained over time. By understanding this effect, researchers may be able to develop better continual learning methods that are more robust to forgetting.

Technical Explanation

The paper presents a series of experiments that investigate the forgetting order in continual learning. The authors use a simple task - classifying handwritten digits from the MNIST dataset - and train a neural network model to learn a sequence of these digit classification tasks.

They find that, contrary to intuition, the examples that the model learns first are forgotten the latest, while the examples learned most recently are forgotten first. The authors call this the "forgetting order" and provide theoretical analysis to explain this phenomenon.

Specifically, the authors show that this forgetting order arises from the interaction between the model's representational capacity and the difficulty of the tasks. Early tasks, being simpler, can be learned using a small number of model parameters. As more complex tasks are introduced later in the sequence, the model needs to use more of its capacity to learn these new tasks, causing it to gradually forget the simpler earlier tasks.

The authors validate this finding across different neural network architectures, task sequences, and continual learning algorithms like experience replay and gradient episodic memory.

Critical Analysis

The paper provides a valuable contribution by uncovering an unexpected and counterintuitive phenomenon in continual learning. Understanding this forgetting order can help guide the development of more effective continual learning systems.

However, the experiments are limited to a simple digit classification task, and it remains to be seen whether the findings generalize to more complex real-world problems. The authors acknowledge that further research is needed to study the forgetting order in more realistic settings.

Additionally, the theoretical analysis provided in the paper makes simplifying assumptions, such as treating the model's representational capacity as a fixed quantity. In practice, the capacity of neural networks can be more dynamic and difficult to characterize.

Despite these limitations, the paper's insights open up interesting directions for future research on mitigating catastrophic forgetting in continual learning systems.

Conclusion

This paper uncovers a surprising finding in continual learning: examples that are learned first are forgotten last, contrary to intuition. The authors provide a theoretical explanation for this "forgetting order" phenomenon and validate it across different experimental settings.

Understanding this order of forgetting can inform the design of more robust continual learning algorithms that are better able to retain important information from earlier tasks as new tasks are learned. By building on these insights, researchers may be able to develop continual learning systems that are more practical and applicable to real-world problems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, Yue Zhang

Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information while acquiring new knowledge. As large language models (LLMs) have demonstrated remarkable performance, it is intriguing to investigate whether CF exists during the continual instruction tuning of LLMs. This study empirically evaluates the forgetting phenomenon in LLMs' knowledge during continual instruction tuning from the perspectives of domain knowledge, reasoning, and reading comprehension. The experiments reveal that catastrophic forgetting is generally observed in LLMs ranging from 1b to 7b parameters. Moreover, as the model scale increases, the severity of forgetting intensifies. Comparing the decoder-only model BLOOMZ with the encoder-decoder model mT0, BLOOMZ exhibits less forgetting and retains more knowledge. Interestingly, we also observe that LLMs can mitigate language biases, such as gender bias, during continual fine-tuning. Furthermore, our findings indicate that ALPACA maintains more knowledge and capacity compared to LLAMA during continual fine-tuning, suggesting that general instruction tuning can help alleviate the forgetting phenomenon in LLMs during subsequent fine-tuning processes.

4/3/2024

cs.CL

Understanding Forgetting in Continual Learning with Linear Regression

Meng Ding, Kaiyi Ji, Di Wang, Jinhui Xu

Continual learning, focused on sequentially learning multiple tasks, has gained significant attention recently. Despite the tremendous progress made in the past, the theoretical understanding, especially factors contributing to catastrophic forgetting, remains relatively unexplored. In this paper, we provide a general theoretical analysis of forgetting in the linear regression model via Stochastic Gradient Descent (SGD) applicable to both underparameterized and overparameterized regimes. Our theoretical framework reveals some interesting insights into the intricate relationship between task sequence and algorithmic parameters, an aspect not fully captured in previous studies due to their restrictive assumptions. Specifically, we demonstrate that, given a sufficiently large data size, the arrangement of tasks in a sequence, where tasks with larger eigenvalues in their population data covariance matrices are trained later, tends to result in increased forgetting. Additionally, our findings highlight that an appropriate choice of step size will help mitigate forgetting in both underparameterized and overparameterized settings. To validate our theoretical analysis, we conducted simulation experiments on both linear regression models and Deep Neural Networks (DNNs). Results from these simulations substantiate our theoretical findings.

5/29/2024

cs.LG

CORE: Mitigating Catastrophic Forgetting in Continual Learning through Cognitive Replay

Jianshu Zhang, Yankai Fu, Ziheng Peng, Dongyu Yao, Kun He

This paper introduces a novel perspective to significantly mitigate catastrophic forgetting in continuous learning (CL), which emphasizes models' capacity to preserve existing knowledge and assimilate new information. Current replay-based methods treat every task and data sample equally and thus can not fully exploit the potential of the replay buffer. In response, we propose COgnitive REplay (CORE), which draws inspiration from human cognitive review processes. CORE includes two key strategies: Adaptive Quantity Allocation and Quality-Focused Data Selection. The former adaptively modulates the replay buffer allocation for each task based on its forgetting rate, while the latter guarantees the inclusion of representative data that best encapsulates the characteristics of each task within the buffer. Our approach achieves an average accuracy of 37.95% on split-CIFAR10, surpassing the best baseline method by 6.52%. Additionally, it significantly enhances the accuracy of the poorest-performing task by 6.30% compared to the top baseline. Code is available at https://github.com/sterzhang/CORE.

4/10/2024

cs.LG cs.AI

✨

Knowledge Accumulation in Continually Learned Representations and the Issue of Feature Forgetting

Timm Hess, Eli Verwimp, Gido M. van de Ven, Tinne Tuytelaars

Continual learning research has shown that neural networks suffer from catastrophic forgetting at the output level, but it is debated whether this is also the case at the level of learned representations. Multiple recent studies ascribe representations a certain level of innate robustness against forgetting -- that they only forget minimally in comparison with forgetting at the output level. We revisit and expand upon the experiments that revealed this difference in forgetting and illustrate the coexistence of two phenomena that affect the quality of continually learned representations: knowledge accumulation and feature forgetting. Taking both aspects into account, we show that, even though forgetting in the representation (i.e. feature forgetting) can be small in absolute terms, when measuring relative to how much was learned during a task, forgetting in the representation tends to be just as catastrophic as forgetting at the output level. Next we show that this feature forgetting is problematic as it substantially slows down the incremental learning of good general representations (i.e. knowledge accumulation). Finally, we study how feature forgetting and knowledge accumulation are affected by different types of continual learning methods.

6/26/2024

cs.LG cs.CV