Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

2405.15319

Published 5/27/2024 by Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, Jie Fu

📈

Abstract

LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical $underline{textit{O}}$bstacles: ($textit{O}$1) lack of comprehensive evaluation, ($textit{O}$2) untested viability for scaling, and ($textit{O}$3) lack of empirical guidelines. To tackle $textit{O}$1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called $G_{text{stack}}$, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into $G_{text{stack}}$ to address $textit{O}$2 and $textit{O}$3. For $textit{O}$2 (untested scalability), our study shows that $G_{text{stack}}$ is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our $G_{text{stack}}$ model converges to the same loss with 194B tokens, resulting in a 54.6% speedup. We further address $textit{O}$3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for $G_{text{stack}}$, making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of $G_{text{stack}}$. Our code and pre-trained model are available at $href{https://llm-stacking.github.io/}{https://llm-stacking.github.io/}$.

Create account to get full access

Overview

Large language models (LLMs) are computationally expensive to pre-train due to their large scale.
Model growth is a promising approach to accelerate the training of larger LLMs by leveraging smaller models.
However, the viability of these model growth methods in efficient LLM pre-training remains underexplored.

Plain English Explanation

Building large, powerful language models is a computationally intensive process that requires a lot of computing power and time. Researchers have explored a technique called "model growth" as a way to speed up this process. The idea is to start with smaller, simpler models and then gradually expand and refine them to create larger, more capable models.

This paper examines three key challenges that need to be addressed to make model growth a viable approach for efficiently pre-training large language models:

Lack of comprehensive evaluation: There hasn't been a thorough, standardized evaluation of the different model growth techniques that have been proposed.
Untested viability for scaling: It's unclear whether these model growth methods can actually work when scaling up to very large language models.
Lack of empirical guidelines: There's a need for clear guidance on how to best apply these model growth techniques in practice.

The researchers tackle these challenges by systematically evaluating different model growth operators, conducting extensive experiments to understand the scalability of the most promising approach, and developing empirical guidelines to make it practical for general large language model pre-training.

Technical Explanation

The researchers summarize existing model growth approaches into four atomic operators and evaluate them in a standardized large language model pre-training setting. They find that a "depthwise stacking" operator, called $G_{text{stack}}$, exhibits remarkable acceleration in training, leading to decreased loss and improved performance on various natural language processing benchmarks compared to strong baselines.

Motivated by these promising results, the researchers conduct extensive experiments to further understand the scalability and practical guidelines for using the $G_{text{stack}}$ operator. Their study shows that $G_{text{stack}}$ is scalable and consistently performs well, even for very large 7B-parameter language models. For example, compared to a conventionally trained 7B model using 300B tokens, their $G_{text{stack}}$ model converges to the same loss with only 194B tokens, resulting in a 54.6% speedup.

The researchers also provide empirical guidelines for determining the optimal growth timing and growth factor for the $G_{text{stack}}$ operator, making it more practical for general large language model pre-training. They include comprehensive ablation studies and discussions of the $G_{text{stack}}$ operator to further understand its inner workings.

Critical Analysis

The paper provides a thorough and rigorous evaluation of model growth techniques for accelerating large language model pre-training. The researchers have identified key challenges in this area and designed experiments to systematically address them. The results demonstrating the scalability and effectiveness of the $G_{text{stack}}$ operator are quite promising.

However, the paper does not delve into the potential limitations or caveats of the proposed approach. For example, it would be helpful to understand how the $G_{text{stack}}$ operator performs on specific tasks or datasets, or how it compares to other model scaling techniques, such as those discussed in related work, Masked Structural Growth, Efficient Inference, Sparse LLMs, or Sheared LLMs. Additionally, the paper does not discuss the computational and memory requirements of the $G_{text{stack}}$ operator, which could be an important consideration for practical deployments.

Conclusion

This paper presents a comprehensive study of model growth techniques for accelerating the pre-training of large language models. The researchers have identified key challenges in this area and systematically evaluated different growth operators, with the $G_{text{stack}}$ operator showing remarkable performance improvements. The extensive experiments on scalability and the development of empirical guidelines make this work a valuable contribution to the field of efficient large language model development.

The findings in this paper could have significant implications for the research and development of next-generation language models, potentially leading to more cost-effective and accessible approaches for building powerful AI systems. However, further research is needed to fully understand the limitations and practical considerations of the proposed techniques.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Landscape-Aware Growing: The Power of a Little LAG

Stefani Karp, Nikunj Saunshi, Sobhan Miryoosefi, Sashank J. Reddi, Sanjiv Kumar

Recently, there has been increasing interest in efficient pretraining paradigms for training Transformer-based models. Several recent approaches use smaller models to initialize larger models in order to save computation (e.g., stacking and fusion). In this work, we study the fundamental question of how to select the best growing strategy from a given pool of growing strategies. Prior works have extensively focused on loss- and/or function-preserving behavior at initialization or simply performance at the end of training. Instead, we identify that behavior at initialization can be misleading as a predictor of final performance and present an alternative perspective based on early training dynamics, which we call landscape-aware growing (LAG). We perform extensive analysis of correlation of the final performance with performance in the initial steps of training and find early and more accurate predictions of the optimal growing strategy (i.e., with only a small lag after initialization). This perspective also motivates an adaptive strategy for gradual stacking.

6/5/2024

cs.LG cs.CL

A Multi-Level Framework for Accelerating Training Transformer Models

Longwei Zou, Han Zhang, Yangdong Deng

The fast growing capabilities of large-scale deep learning models, such as Bert, GPT and ViT, are revolutionizing the landscape of NLP, CV and many other domains. Training such models, however, poses an unprecedented demand for computing power, which incurs exponentially increasing energy cost and carbon dioxide emissions. It is thus critical to develop efficient training solutions to reduce the training costs. Motivated by a set of key observations of inter- and intra-layer similarities among feature maps and attentions that can be identified from typical training processes, we propose a multi-level framework for training acceleration. Specifically, the framework is based on three basic operators, Coalescing, De-coalescing and Interpolation, which can be orchestrated to build a multi-level training framework. The framework consists of a V-cycle training process, which progressively down- and up-scales the model size and projects the parameters between adjacent levels of models via coalescing and de-coalescing. The key idea is that a smaller model that can be trained for fast convergence and the trained parameters provides high-qualities intermediate solutions for the next level larger network. The interpolation operator is designed to break the symmetry of neurons incurred by de-coalescing for better convergence performance. Our experiments on transformer-based language models (e.g. Bert, GPT) as well as a vision model (e.g. DeiT) prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model while preserving the performance.

4/15/2024

cs.LG cs.CL

💬

Masked Structural Growth for 2x Faster Language Model Pre-training

Yiqun Yao, Zheng Zhang, Jing Li, Yequan Wang

Accelerating large language model pre-training is a critical issue in present research. In this paper, we focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one. There are two main research problems associated with progressive growth: determining the optimal growth schedule, and designing efficient growth operators. In terms of growth schedule, the impact of each single dimension on a schedule's efficiency is under-explored by existing work. Regarding the growth operators, existing methods rely on the initialization of new weights to inherit knowledge, and achieve only non-strict function preservation, limiting further improvements on training dynamics. To address these issues, we propose Masked Structural Growth (MSG), including (i) growth schedules involving all possible dimensions and (ii) strictly function-preserving growth operators that is independent of the initialization of new weights. Experiments show that MSG is significantly faster than related work: we achieve up to 2.2x speedup in pre-training different types of language models while maintaining comparable or better downstream performances. Code is publicly available at https://github.com/cofe-ai/MSG.

4/9/2024

cs.CL

🤯

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Georgy Tyukin

Large Language Models are growing in size, and we expect them to continue to do so, as larger models train quicker. However, this increase in size will severely impact inference costs. Therefore model compression is important, to retain the performance of larger models, but with a reduced cost of running them. In this thesis we explore the methods of model compression, and we empirically demonstrate that the simple method of skipping latter attention sublayers in Transformer LLMs is an effective method of model compression, as these layers prove to be redundant, whilst also being incredibly computationally expensive. We observed a 21% speed increase in one-token generation for Llama 2 7B, whilst surprisingly and unexpectedly improving performance over several common benchmarks.

4/10/2024

cs.LG cs.AI cs.CL cs.PF