Exploring Learngene via Stage-wise Weight Sharing for Initializing Variable-sized Models

Read original: arXiv:2404.16897 - Published 4/29/2024 by Shi-Yu Xia, Wenxuan Zhu, Xu Yang, Xin Geng

🧪

Overview

The paper introduces a framework called Learngene, which learns a compact "learngene" part from a large, well-trained model and then expands it to initialize variable-sized models.
The paper proposes a specific Learngene approach called SWS (Stage-wise Weight Sharing) that helps provide guidance for initializing models at varying scales.
SWS achieves better performance compared to training models from scratch, while reducing total training costs by around 6.6x.
When initializing variable-sized models, SWS achieves better results while reducing the parameters stored to initialize these models by around 20x and the pre-training costs by around 10x, compared to the pre-training and fine-tuning approach.

Plain English Explanation

The paper discusses a way to build variable-sized models that can adapt to different resource constraints in various application scenarios. The key step is weight initialization before training the models.

The researchers introduce a framework called Learngene, which first learns a compact "learngene" part from a large, well-trained model. This learngene is then expanded to initialize variable-sized models of different sizes.

The paper proposes a specific Learngene approach called SWS (Stage-wise Weight Sharing). In SWS, the researchers build an auxiliary model with multiple stages, where the layer weights are shared across the stages. This model is then trained through distillation. The learngene layers containing the stage information are then expanded to initialize variable-sized models of different depths.

The key idea is that the stage information in the learngene layers provides guidance for initializing the models at varying scales, leading to better performance compared to training from scratch. This approach also significantly reduces the total training costs, as well as the parameters and pre-training costs needed for initializing variable-sized models adapted for different resource constraints.

Technical Explanation

The paper introduces the Learngene framework, which learns a compact "learngene" part from a large, well-trained model, and then expands this learngene to initialize variable-sized models.

The researchers propose a specific Learngene approach called SWS (Stage-wise Weight Sharing). In SWS, they build an auxiliary model comprising multiple stages, where the layer weights in each stage are shared. They then train this auxiliary model through distillation, using a large, well-trained model as the teacher.

The key innovation in SWS is that the learngene layers contain stage information, which provides guidance for initializing variable-sized models of different depths. The researchers expand these learngene layers with stage information to initialize the models at varying scales.

Extensive experiments on the ImageNet-1K dataset show that SWS achieves consistent better performance compared to many models trained from scratch, while reducing the total training costs by around 6.6x. In some cases, SWS performs better after just 1 epoch of fine-tuning.

When initializing variable-sized models adapted for different resource constraints, SWS achieves better results while reducing the parameters stored to initialize these models by around 20x and the pre-training costs by around 10x, in contrast to the pre-training and fine-tuning approach.

Critical Analysis

The paper presents a novel and effective approach, SWS, for initializing variable-sized models adapted for different resource constraints. The key strengths of the approach are the stage information in the learngene layers and the use of distillation to train the auxiliary model.

One potential limitation is that the paper only evaluates SWS on the ImageNet-1K dataset. It would be interesting to see how the approach performs on other diverse datasets and application domains, such as medical image segmentation or online transfer learning.

Additionally, the paper does not provide much insight into the analysis of the classifier-free guidance or the weight schedulers used in the experiments. Further investigation into these aspects could yield additional insights and potentially lead to further improvements.

Overall, the SWS approach presented in the paper is a promising contribution to the field of variable-sized model initialization, and the researchers have demonstrated its effectiveness through extensive experiments.

Conclusion

The paper introduces the Learngene framework and proposes a specific approach called SWS (Stage-wise Weight Sharing) for initializing variable-sized models adapted for different resource constraints.

The key innovation in SWS is the use of stage information in the learngene layers, which provides guidance for initializing models at varying scales. This approach achieves better performance compared to training from scratch, while significantly reducing the total training costs, the parameters stored, and the pre-training costs needed for initializing variable-sized models.

The paper demonstrates the effectiveness of the SWS approach through extensive experiments on the ImageNet-1K dataset and highlights its potential for diverse application scenarios where variable-sized models are required.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧪

Exploring Learngene via Stage-wise Weight Sharing for Initializing Variable-sized Models

Shi-Yu Xia, Wenxuan Zhu, Xu Yang, Xin Geng

In practice, we usually need to build variable-sized models adapting for diverse resource constraints in different application scenarios, where weight initialization is an important step prior to training. The Learngene framework, introduced recently, firstly learns one compact part termed as learngene from a large well-trained model, after which learngene is expanded to initialize variable-sized models. In this paper, we start from analysing the importance of guidance for the expansion of well-trained learngene layers, inspiring the design of a simple but highly effective Learngene approach termed SWS (Stage-wise Weight Sharing), where both learngene layers and their learning process critically contribute to providing knowledge and guidance for initializing models at varying scales. Specifically, to learn learngene layers, we build an auxiliary model comprising multiple stages where the layer weights in each stage are shared, after which we train it through distillation. Subsequently, we expand these learngene layers containing stage information at their corresponding stage to initialize models of variable depths. Extensive experiments on ImageNet-1K demonstrate that SWS achieves consistent better performance compared to many models trained from scratch, while reducing around 6.6x total training costs. In some cases, SWS performs better only after 1 epoch tuning. When initializing variable-sized models adapting for different resource constraints, SWS achieves better results while reducing around 20x parameters stored to initialize these models and around 10x pre-training costs, in contrast to the pre-training and fine-tuning approach.

4/29/2024

WAVE: Weight Template for Adaptive Initialization of Variable-sized Models

Fu Feng, Yucheng Xie, Jing Wang, Xin Geng

The expansion of model parameters underscores the significance of pre-trained models; however, the constraints encountered during model deployment necessitate models of variable sizes. Consequently, the traditional pre-training and fine-tuning paradigm fails to address the initialization problem when target models are incompatible with pre-trained models. We tackle this issue from a multitasking perspective and introduce textbf{WAVE}, which incorporates a set of shared textbf{W}eight templates for textbf{A}daptive initialization of textbf{V}ariable-siztextbf{E}d Models. During initialization, target models will initialize the corresponding weight scalers tailored to their model size, which are sufficient to learn the connection rules of weight templates based on the Kronecker product from a limited amount of data. For the construction of the weight templates, WAVE utilizes the textit{Learngene} framework, which structurally condenses common knowledge from ancestry models into weight templates as the learngenes through knowledge distillation. This process allows the integration of pre-trained models' knowledge into structured knowledge according to the rules of weight templates. We provide a comprehensive benchmark for the learngenes, and extensive experiments demonstrate the efficacy of WAVE. The results show that WAVE achieves state-of-the-art performance when initializing models with various depth and width, and even outperforms the direct pre-training of $n$ entire models, particularly for smaller models, saving approximately $ntimes$ and $5times$ in computational and storage resources, respectively. WAVE simultaneously achieves the most efficient knowledge transfer across a series of datasets, specifically achieving an average improvement of 1.8% and 1.2% on 7 downstream datasets.

7/16/2024

Weight Scope Alignment: A Frustratingly Easy Method for Model Merging

Yichu Xu, Xin-Chun Li, Le Gan, De-Chuan Zhan

Merging models becomes a fundamental procedure in some applications that consider model efficiency and robustness. The training randomness or Non-I.I.D. data poses a huge challenge for averaging-based model fusion. Previous research efforts focus on element-wise regularization or neural permutations to enhance model averaging while overlooking weight scope variations among models, which can significantly affect merging effectiveness. In this paper, we reveal variations in weight scope under different training conditions, shedding light on its influence on model merging. Fortunately, the parameters in each layer basically follow the Gaussian distribution, which inspires a novel and simple regularization approach named Weight Scope Alignment (WSA). It contains two key components: 1) leveraging a target weight scope to guide the model training process for ensuring weight scope matching in the subsequent model merging. 2) fusing the weight scope of two or more models into a unified one for multi-stage model fusion. We extend the WSA regularization to two different scenarios, including Mode Connectivity and Federated Learning. Abundant experimental studies validate the effectiveness of our approach.

8/23/2024

Weights Augmentation: it has never ever ever ever let her model down

Junbin Zhuang, Guiguang Din, Yunyi Yan

Weight play an essential role in deep learning network models. Unlike network structure design, this article proposes the concept of weight augmentation, focusing on weight exploration. The core of Weight Augmentation Strategy (WAS) is to adopt random transformed weight coefficients training and transformed coefficients, named Shadow Weight(SW), for networks that can be used to calculate loss function to affect parameter updates. However, stochastic gradient descent is applied to Plain Weight(PW), which is referred to as the original weight of the network before the random transformation. During training, numerous SW collectively form high-dimensional space, while PW is directly learned from the distribution of SW instead of the data. The weight of the accuracy-oriented mode(AOM) relies on PW, which guarantees the network is highly robust and accurate. The desire-oriented mode(DOM) weight uses SW, which is determined by the network model's unique functions based on WAT's performance desires, such as lower computational complexity, lower sensitivity to particular data, etc. The dual mode be switched at anytime if needed. WAT extends the augmentation technique from data augmentation to weight, and it is easy to understand and implement, but it can improve almost all networks amazingly. Our experimental results show that convolutional neural networks, such as VGG-16, ResNet-18, ResNet-34, GoogleNet, MobilementV2, and Efficientment-Lite, can benefit much at little or no cost. The accuracy of models is on the CIFAR100 and CIFAR10 datasets, which can be evaluated to increase by 7.32% and 9.28%, respectively, with the highest values being 13.42% and 18.93%, respectively. In addition, DOM can reduce floating point operations (FLOPs) by up to 36.33%. The code is available at https://github.com/zlearh/Weight-Augmentation-Technology.

5/31/2024