Gradient Projection For Parameter-Efficient Continual Learning

2405.13383

Published 5/24/2024 by Jingyang Qiao, Zhizhong Zhang, Xin Tan, Yanyun Qu, Wensheng Zhang, Yuan Xie

🎲

Abstract

Catastrophic forgetting poses the primary challenge in the continual learning. Nowadays, methods based on parameter-efficient tuning (PET) have demonstrated impressive performance in continual learning. However, these methods are still confronted with a common problem: fine-tuning on consecutive distinct tasks can disrupt the existing parameter distribution and lead to forgetting. Recent progress mainly focused in empirically designing efficient tuning engineering, lacking investigation of forgetting generation mechanism, anti-forgetting criteria and providing theoretical support. Additionally, the unresolved trade-off between learning new content and protecting old knowledge further complicates these challenges. The gradient projection methodology restricts gradient updates to the orthogonal direction of the old feature space, preventing distribution of the parameters from being damaged during updating and significantly suppressing forgetting. Developing on it, in this paper, we reformulate Adapter, LoRA, Prefix, and Prompt to continual learning setting from the perspective of gradient projection, and propose a unified framework called Parameter Efficient Gradient Projection (PEGP). Based on the hypothesis that old tasks should have the same results after model updated, we introduce orthogonal gradient projection into different PET paradigms and theoretically demonstrate that the orthogonal condition for the gradient can effectively resist forgetting in PET-based continual methods. Notably, PEGP is the first unified method to provide an anti-forgetting mechanism with mathematical demonstration for different tuning paradigms. We extensively evaluate our method with different backbones on diverse datasets, and experiments demonstrate its efficiency in reducing forgetting in various incremental settings.

Create account to get full access

Overview

Continual learning is a major challenge in machine learning, where models need to learn new tasks without forgetting previous knowledge.
Methods based on parameter-efficient tuning (PET) have shown promising results, but still struggle with disrupting existing parameter distributions and leading to forgetting.
The paper proposes a unified framework called Parameter Efficient Gradient Projection (PEGP) that introduces orthogonal gradient projection to different PET paradigms to effectively resist forgetting.

Plain English Explanation

Machine learning models often need to learn new tasks over time, but this can cause them to forget what they previously learned - a problem known as "catastrophic forgetting." Recent approaches based on parameter-efficient tuning (PET) have shown good results, but they still struggle with disrupting the existing parameters in a way that leads to forgetting.

The key idea in this paper is to use "gradient projection" - a mathematical technique that restricts the updates to the model's parameters in a way that preserves the knowledge from previous tasks. The authors take this concept and apply it to different PET methods, creating a unified framework called PEGP. The underlying hypothesis is that if the model's outputs for old tasks remain the same after updating, it will be able to avoid forgetting.

By incorporating this orthogonal gradient projection into PET approaches like LoRA, Prefix, and Prompt, the authors demonstrate that PEGP can effectively resist forgetting in a wide range of continual learning scenarios.

Technical Explanation

The paper proposes a unified framework called Parameter Efficient Gradient Projection (PEGP) that builds on the gradient projection methodology. Gradient projection restricts gradient updates to the orthogonal direction of the old feature space, preventing the parameter distribution from being disrupted during updates and significantly suppressing forgetting.

The authors reformulate several PET methods - Adapter, LoRA, Prefix, and Prompt - into the continual learning setting from the perspective of gradient projection. They introduce orthogonal gradient projection into these different PET paradigms and theoretically demonstrate that the orthogonal condition for the gradient can effectively resist forgetting in PET-based continual learning methods.

PEGP is the first unified method to provide an anti-forgetting mechanism with mathematical proof for these diverse tuning paradigms. The authors extensively evaluate their method using different model backbones and datasets, and the experiments show PEGP's efficiency in reducing forgetting across various incremental learning settings.

Critical Analysis

The paper makes a significant contribution by providing a unified framework that can be applied to multiple PET methods to address the challenge of catastrophic forgetting in continual learning. The theoretical analysis and mathematical demonstration of the anti-forgetting mechanism are particular strengths of the work.

However, the paper does not fully explore the limitations of the PEGP approach. For example, it would be valuable to understand how PEGP performs in scenarios with more than two or three consecutive tasks, or how it scales to large-scale models and datasets. Additionally, the authors could have discussed potential trade-offs between the overhead of the gradient projection computations and the benefits of reduced forgetting.

Further research could also investigate the interplay between PEGP and other continual learning techniques, such as adaptive methods or mixture-of-experts approaches. Combining PEGP with complementary methods may lead to even more robust and effective continual learning solutions.

Conclusion

The Parameter Efficient Gradient Projection (PEGP) framework proposed in this paper represents a significant step forward in addressing the challenge of catastrophic forgetting in continual learning. By incorporating orthogonal gradient projection into various PET methods, the authors have developed a unified approach that can effectively resist forgetting while maintaining performance on new tasks.

The theoretical analysis and experimental results demonstrate the effectiveness of PEGP, which has the potential to be a valuable tool for researchers and practitioners working on continual learning problems. While the paper does not fully explore all the limitations of the approach, it provides a strong foundation for future work in this important area of machine learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🌿

Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning

Mohamed Elsayed, A. Rupam Mahmood

Deep representation learning methods struggle with continual learning, suffering from both catastrophic forgetting of useful units and loss of plasticity, often due to rigid and unuseful units. While many methods address these two issues separately, only a few currently deal with both simultaneously. In this paper, we introduce Utility-based Perturbed Gradient Descent (UPGD) as a novel approach for the continual learning of representations. UPGD combines gradient updates with perturbations, where it applies smaller modifications to more useful units, protecting them from forgetting, and larger modifications to less useful units, rejuvenating their plasticity. We use a challenging streaming learning setup where continual learning problems have hundreds of non-stationarities and unknown task boundaries. We show that many existing methods suffer from at least one of the issues, predominantly manifested by their decreasing accuracy over tasks. On the other hand, UPGD continues to improve performance and surpasses or is competitive with all methods in all problems. Finally, in extended reinforcement learning experiments with PPO, we show that while Adam exhibits a performance drop after initial learning, UPGD avoids it by addressing both continual learning issues.

5/2/2024

cs.LG cs.AI

ColA: Collaborative Adaptation with Gradient Learning

Enmao Diao, Qi Le, Suya Wu, Xinran Wang, Ali Anwar, Jie Ding, Vahid Tarokh

A primary function of back-propagation is to compute both the gradient of hidden representations and parameters for optimization with gradient descent. Training large models requires high computational costs due to their vast parameter sizes. While Parameter-Efficient Fine-Tuning (PEFT) methods aim to train smaller auxiliary models to save computational space, they still present computational overheads, especially in Fine-Tuning as a Service (FTaaS) for numerous users. We introduce Collaborative Adaptation (ColA) with Gradient Learning (GL), a parameter-free, model-agnostic fine-tuning approach that decouples the computation of the gradient of hidden representations and parameters. In comparison to PEFT methods, ColA facilitates more cost-effective FTaaS by offloading the computation of the gradient to low-cost devices. We also provide a theoretical analysis of ColA and experimentally demonstrate that ColA can perform on par or better than existing PEFT methods on various benchmarks.

4/23/2024

cs.LG cs.AI

Understanding Forgetting in Continual Learning with Linear Regression

Meng Ding, Kaiyi Ji, Di Wang, Jinhui Xu

Continual learning, focused on sequentially learning multiple tasks, has gained significant attention recently. Despite the tremendous progress made in the past, the theoretical understanding, especially factors contributing to catastrophic forgetting, remains relatively unexplored. In this paper, we provide a general theoretical analysis of forgetting in the linear regression model via Stochastic Gradient Descent (SGD) applicable to both underparameterized and overparameterized regimes. Our theoretical framework reveals some interesting insights into the intricate relationship between task sequence and algorithmic parameters, an aspect not fully captured in previous studies due to their restrictive assumptions. Specifically, we demonstrate that, given a sufficiently large data size, the arrangement of tasks in a sequence, where tasks with larger eigenvalues in their population data covariance matrices are trained later, tends to result in increased forgetting. Additionally, our findings highlight that an appropriate choice of step size will help mitigate forgetting in both underparameterized and overparameterized settings. To validate our theoretical analysis, we conducted simulation experiments on both linear regression models and Deep Neural Networks (DNNs). Results from these simulations substantiate our theoretical findings.

5/29/2024

cs.LG

Visual Prompt Tuning in Null Space for Continual Learning

Yue Lu, Shizhou Zhang, De Cheng, Yinghui Xing, Nannan Wang, Peng Wang, Yanning Zhang

Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL), by selecting and updating relevant prompts in the vision-transformer models. On the contrary, this paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features, so as to ensure no interference on tasks that have been learned to overcome catastrophic forgetting in CL. However, different from the orthogonal projection in the traditional CNN architecture, the prompt gradient orthogonal projection in the ViT architecture shows completely different and greater challenges, i.e., 1) the high-order and non-linear self-attention operation; 2) the drift of prompt distribution brought by the LayerNorm in the transformer block. Theoretically, we have finally deduced two consistency conditions to achieve the prompt gradient orthogonal projection, which provide a theoretical guarantee of eliminating interference on previously learned knowledge via the self-attention mechanism in visual prompt tuning. In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient orthogonal projection. Extensive experimental results demonstrate the effectiveness of anti-forgetting on four class-incremental benchmarks with diverse pre-trained baseline models, and our approach achieves superior performances to state-of-the-art methods. Our code is available at https://github.com/zugexiaodui/VPTinNSforCL.

6/12/2024

cs.CV cs.AI