Two Complementary Perspectives to Continual Learning: Ask Not Only What to Optimize, But Also How

2311.04898

Published 6/24/2024 by Timm Hess, Tinne Tuytelaars, Gido M. van de Ven

🧠

Abstract

Recent years have seen considerable progress in the continual training of deep neural networks, predominantly thanks to approaches that add replay or regularization terms to the loss function to approximate the joint loss over all tasks so far. However, we show that even with a perfect approximation to the joint loss, these approaches still suffer from temporary but substantial forgetting when starting to train on a new task. Motivated by this 'stability gap', we propose that continual learning strategies should focus not only on the optimization objective, but also on the way this objective is optimized. While there is some continual learning work that alters the optimization trajectory (e.g., using gradient projection techniques), this line of research is positioned as alternative to improving the optimization objective, while we argue it should be complementary. In search of empirical support for our proposition, we perform a series of pre-registered experiments combining replay-approximated joint objectives with gradient projection-based optimization routines. However, this first experimental attempt fails to show clear and consistent benefits. Nevertheless, our conceptual arguments, as well as some of our empirical results, demonstrate the distinctive importance of the optimization trajectory in continual learning, thereby opening up a new direction for continual learning research.

Create account to get full access

Overview

Recent research has made significant progress in continual training of deep neural networks, often by adding replay or regularization terms to the loss function.
However, these approaches still suffer from temporary but substantial forgetting when starting to train on a new task, even with a perfect approximation of the joint loss.
The paper argues that continual learning strategies should focus not only on the optimization objective, but also on the optimization trajectory.

Plain English Explanation

The paper discusses the challenges of continual learning, where deep neural networks are trained on a series of tasks over time. Existing approaches have made progress by modifying the loss function, such as adding replay or regularization terms to help the model remember previous tasks. However, the paper shows that even with a perfect approximation of the joint loss across all tasks, these models still experience substantial forgetting when starting to learn a new task.

The key insight is that continual learning strategies should consider not just the optimization objective, but also how that objective is optimized. Some previous work has explored altering the optimization trajectory, for example using gradient projection techniques, but this has been positioned as an alternative to improving the objective. The paper argues that optimizing the trajectory should be a complementary approach.

To support this idea, the researchers conducted experiments combining replay-based objectives with gradient projection-based optimization routines. While the initial results did not show clear benefits, the paper still makes a conceptual argument for the importance of the optimization trajectory in continual learning. This opens up a new direction for future research in this area.

Technical Explanation

The paper explores the challenge of continual learning, where deep neural networks are trained on a sequence of tasks over time. Existing approaches have made progress using techniques that add replay or regularization terms to the loss function to approximate the joint loss over all tasks. However, the authors show that even with a perfect approximation of the joint loss, these methods still suffer from temporary but substantial forgetting when starting to train on a new task.

Motivated by this "stability gap," the authors argue that continual learning strategies should focus not only on the optimization objective, but also on the way this objective is optimized. While there is some prior work that alters the optimization trajectory, such as using gradient projection techniques, this has been positioned as an alternative to improving the optimization objective. The authors contend that optimizing the trajectory should be a complementary approach.

To provide empirical support for this proposition, the researchers conducted a series of pre-registered experiments that combined replay-approximated joint objectives with gradient projection-based optimization routines. However, this initial attempt did not clearly demonstrate the benefits of this approach.

Nevertheless, the conceptual arguments in the paper, as well as some of the empirical results, highlight the distinctive importance of the optimization trajectory in continual learning. This opens up a new direction for future research in continual learning across numerous tasks, potentially including provable approaches or weight interpolation techniques.

Critical Analysis

The paper makes a compelling case for the importance of considering the optimization trajectory in addition to the optimization objective for continual learning. While the initial experiments did not clearly demonstrate the benefits of this approach, the conceptual arguments are well-reasoned and open up an interesting new direction for future research.

One potential limitation of the work is the scope of the experiments, which may not have fully captured the benefits of the proposed approach. The authors acknowledge that further investigation is needed to determine the conditions under which optimizing the trajectory can provide clear advantages over solely optimizing the objective.

Additionally, the paper does not delve into the potential computational and implementation challenges of jointly optimizing the objective and trajectory, which could be an important practical consideration for deploying these techniques in real-world applications.

Overall, the paper makes a valuable contribution by highlighting the distinctive importance of the optimization trajectory in continual learning and encouraging researchers to explore this promising area further. Readers are encouraged to think critically about the insights presented and consider how they might be applied or extended in future work.

Conclusion

This paper argues that continual learning strategies should focus not only on the optimization objective, but also on the optimization trajectory. While existing approaches have made progress by modifying the loss function, the authors show that these methods still suffer from substantial forgetting when learning new tasks.

The key insight is that the optimization trajectory, in addition to the objective, plays a crucial role in mitigating forgetting and enabling effective continual learning. This opens up a new direction for future research in this area, potentially leading to more robust and flexible deep neural networks that can continuously adapt to new information without catastrophic forgetting.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

On the Convergence of Continual Learning with Adaptive Methods

Seungyub Han, Yeongmo Kim, Taehyun Cho, Jungwoo Lee

One of the objectives of continual learning is to prevent catastrophic forgetting in learning multiple tasks sequentially, and the existing solutions have been driven by the conceptualization of the plasticity-stability dilemma. However, the convergence of continual learning for each sequential task is less studied so far. In this paper, we provide a convergence analysis of memory-based continual learning with stochastic gradient descent and empirical evidence that training current tasks causes the cumulative degradation of previous tasks. We propose an adaptive method for nonconvex continual learning (NCCL), which adjusts step sizes of both previous and current tasks with the gradients. The proposed method can achieve the same convergence rate as the SGD method when the catastrophic forgetting term which we define in the paper is suppressed at each iteration. Further, we demonstrate that the proposed algorithm improves the performance of continual learning over existing methods for several image classification tasks.

4/16/2024

cs.LG cs.AI stat.ML

Primal Dual Continual Learning: Balancing Stability and Plasticity through Adaptive Memory Allocation

Juan Elenter, Navid NaderiAlizadeh, Tara Javidi, Alejandro Ribeiro

Continual learning is inherently a constrained learning problem. The goal is to learn a predictor under a no-forgetting requirement. Although several prior studies formulate it as such, they do not solve the constrained problem explicitly. In this work, we show that it is both possible and beneficial to undertake the constrained optimization problem directly. To do this, we leverage recent results in constrained learning through Lagrangian duality. We focus on memory-based methods, where a small subset of samples from previous tasks can be stored in a replay buffer. In this setting, we analyze two versions of the continual learning problem: a coarse approach with constraints at the task level and a fine approach with constraints at the sample level. We show that dual variables indicate the sensitivity of the optimal value of the continual learning problem with respect to constraint perturbations. We then leverage this result to partition the buffer in the coarse approach, allocating more resources to harder tasks, and to populate the buffer in the fine approach, including only impactful samples. We derive a deviation bound on dual variables as sensitivity indicators, and empirically corroborate this result in diverse continual learning benchmarks. We also discuss the limitations of these methods with respect to the amount of memory available and the expressiveness of the parametrization.

6/3/2024

cs.LG cs.AI eess.SP

✨

Continual Learning of Numerous Tasks from Long-tail Distributions

Liwei Kang, Wee Sun Lee

Continual learning, an important aspect of artificial intelligence and machine learning research, focuses on developing models that learn and adapt to new tasks while retaining previously acquired knowledge. Existing continual learning algorithms usually involve a small number of tasks with uniform sizes and may not accurately represent real-world learning scenarios. In this paper, we investigate the performance of continual learning algorithms with a large number of tasks drawn from a task distribution that is long-tail in terms of task sizes. We design one synthetic dataset and two real-world continual learning datasets to evaluate the performance of existing algorithms in such a setting. Moreover, we study an overlooked factor in continual learning, the optimizer states, e.g. first and second moments in the Adam optimizer, and investigate how it can be used to improve continual learning performance. We propose a method that reuses the optimizer states in Adam by maintaining a weighted average of the second moments from previous tasks. We demonstrate that our method, compatible with most existing continual learning algorithms, effectively reduces forgetting with only a small amount of additional computational or memory costs, and provides further improvements on existing continual learning algorithms, particularly in a long-tail task sequence.

4/4/2024

cs.LG

Provable Contrastive Continual Learning

Yichen Wen, Zhiquan Tan, Kaipeng Zheng, Chuanlong Xie, Weiran Huang

Continual learning requires learning incremental tasks with dynamic data distributions. So far, it has been observed that employing a combination of contrastive loss and distillation loss for training in continual learning yields strong performance. To the best of our knowledge, however, this contrastive continual learning framework lacks convincing theoretical explanations. In this work, we fill this gap by establishing theoretical performance guarantees, which reveal how the performance of the model is bounded by training losses of previous tasks in the contrastive continual learning framework. Our theoretical explanations further support the idea that pre-training can benefit continual learning. Inspired by our theoretical analysis of these guarantees, we propose a novel contrastive continual learning algorithm called CILA, which uses adaptive distillation coefficients for different tasks. These distillation coefficients are easily computed by the ratio between average distillation losses and average contrastive losses from previous tasks. Our method shows great improvement on standard benchmarks and achieves new state-of-the-art performance.

5/30/2024

cs.LG cs.AI cs.CV stat.ML