Masked Structural Growth for 2x Faster Language Model Pre-training

2305.02869

YC

0

Reddit

0

Published 4/9/2024 by Yiqun Yao, Zheng Zhang, Jing Li, Yequan Wang

💬

Abstract

Accelerating large language model pre-training is a critical issue in present research. In this paper, we focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one. There are two main research problems associated with progressive growth: determining the optimal growth schedule, and designing efficient growth operators. In terms of growth schedule, the impact of each single dimension on a schedule's efficiency is under-explored by existing work. Regarding the growth operators, existing methods rely on the initialization of new weights to inherit knowledge, and achieve only non-strict function preservation, limiting further improvements on training dynamics. To address these issues, we propose Masked Structural Growth (MSG), including (i) growth schedules involving all possible dimensions and (ii) strictly function-preserving growth operators that is independent of the initialization of new weights. Experiments show that MSG is significantly faster than related work: we achieve up to 2.2x speedup in pre-training different types of language models while maintaining comparable or better downstream performances. Code is publicly available at https://github.com/cofe-ai/MSG.

Get summaries of the top AI research delivered straight to your inbox:

Overview

  • This paper focuses on speeding up the pre-training of large language models (LLMs) by progressively growing the model from a small Transformer structure to a larger one.
  • The key challenges addressed are determining the optimal growth schedule and designing efficient growth operators.
  • The authors propose Masked Structural Growth (MSG), which includes growth schedules involving all possible dimensions and strictly function-preserving growth operators.
  • Experiments show that MSG can achieve up to 2.2x speedup in pre-training different types of LLMs while maintaining comparable or better downstream performance.

Plain English Explanation

Large language models like SOLAR-107B are powerful tools for generative software engineering and other applications, but pre-training them can be very time-consuming. This research aims to speed up the pre-training process by starting with a small Transformer model and gradually growing it into a larger one.

The key challenges are figuring out the best way to grow the model (the "growth schedule") and designing the growth process so that the new, larger model still works well (the "growth operators"). Previous methods have had some limitations in these areas.

The researchers propose a new approach called Masked Structural Growth (MSG) that addresses these issues. MSG has two main components:

  1. Growth schedules that involve changing all the different parts of the model, rather than just a few.
  2. Growth operators that strictly preserve the function of the model, so the new, larger model works just as well as the old one.

Experiments show that MSG can pre-train different types of language models 2.2 times faster than previous methods, while still maintaining good performance on downstream tasks.

Technical Explanation

The key research problems addressed in this paper are:

  1. Determining the optimal growth schedule: Existing work has not fully explored the impact of changing different dimensions of the model (e.g., depth, width, attention heads) on the overall efficiency of the growth schedule.

  2. Designing efficient growth operators: Previous methods rely on initializing new weights to inherit knowledge, which leads to only non-strict function preservation. This limits further improvements to the training dynamics.

To address these issues, the authors propose Masked Structural Growth (MSG), which includes:

  1. Growth schedules involving all possible dimensions: MSG explores growth schedules that progressively increase the depth, width, and attention heads of the Transformer model.

  2. Strictly function-preserving growth operators: MSG introduces growth operators that strictly preserve the function of the model, independent of the initialization of new weights. This allows for more efficient training dynamics.

The authors evaluate MSG on pre-training different types of language models and benchmarking small LLMs. The results show that MSG can achieve up to 2.2x speedup in pre-training while maintaining comparable or better downstream performance.

Critical Analysis

The paper provides a thorough analysis of the limitations of existing progressive growth methods and proposes a novel approach to address them. The authors have carefully designed the growth schedules and operators to improve the efficiency of the pre-training process.

However, the paper does not discuss the potential limitations or drawbacks of the MSG approach. For example, it would be interesting to understand the impact of the growth schedule on the final model's performance, or whether there are any trade-offs between the speed-up and other model characteristics.

Additionally, the authors could have explored the generalizability of MSG beyond language models, as the progressive growth approach could potentially be applicable to other types of large neural networks as well.

Overall, the research represents a significant contribution to the field of accelerating large language model pre-training, and the open-sourcing of the code is a valuable resource for the community.

Conclusion

This paper presents a novel approach, Masked Structural Growth (MSG), to speed up the pre-training of large language models. By designing efficient growth schedules and operators, MSG can achieve up to 2.2x speedup in pre-training while maintaining comparable or better downstream performance.

The key innovations of MSG, including the exploration of all possible growth dimensions and the strict function-preserving growth operators, address important limitations in existing progressive growth methods. This work represents a significant advance in the field of accelerating the development of large and powerful language models, which are crucial for a wide range of applications in generative software engineering and beyond.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Multi-Level Framework for Accelerating Training Transformer Models

A Multi-Level Framework for Accelerating Training Transformer Models

Longwei Zou, Han Zhang, Yangdong Deng

YC

0

Reddit

0

The fast growing capabilities of large-scale deep learning models, such as Bert, GPT and ViT, are revolutionizing the landscape of NLP, CV and many other domains. Training such models, however, poses an unprecedented demand for computing power, which incurs exponentially increasing energy cost and carbon dioxide emissions. It is thus critical to develop efficient training solutions to reduce the training costs. Motivated by a set of key observations of inter- and intra-layer similarities among feature maps and attentions that can be identified from typical training processes, we propose a multi-level framework for training acceleration. Specifically, the framework is based on three basic operators, Coalescing, De-coalescing and Interpolation, which can be orchestrated to build a multi-level training framework. The framework consists of a V-cycle training process, which progressively down- and up-scales the model size and projects the parameters between adjacent levels of models via coalescing and de-coalescing. The key idea is that a smaller model that can be trained for fast convergence and the trained parameters provides high-qualities intermediate solutions for the next level larger network. The interpolation operator is designed to break the symmetry of neurons incurred by de-coalescing for better convergence performance. Our experiments on transformer-based language models (e.g. Bert, GPT) as well as a vision model (e.g. DeiT) prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model while preserving the performance.

Read more

4/15/2024

Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation

Abhishek Aich, Yumin Suh, Samuel Schulter, Manmohan Chandraker

YC

0

Reddit

0

A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions. With efficiency being a high priority for scaling such models, we observed that the state-of-the-art method Mask2Former uses ~50% of its compute only on the transformer encoder. This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer. With this observation, we propose a strategy termed PROgressive Token Length SCALing for Efficient transformer encoders (PRO-SCALE) that can be plugged-in to the Mask2Former-style segmentation architectures to significantly reduce the computational cost. The underlying principle of PRO-SCALE is: progressively scale the length of the tokens with the layers of the encoder. This allows PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance (~52% GFLOPs reduction with no drop in performance on COCO dataset). We validate our framework on multiple public benchmarks.

Read more

4/24/2024

💬

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen

YC

0

Reddit

0

The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, OpenLLaMA and the concurrent TinyLlama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building competitive small-scale LLMs

Read more

4/12/2024

Accelerating Transformer Pre-Training with 2:4 Sparsity

Accelerating Transformer Pre-Training with 2:4 Sparsity

Yuezhou Hu, Kang Zhao, Weiyu Huang, Jianfei Chen, Jun Zhu

YC

0

Reddit

0

Training large Transformers is slow, but recent innovations on GPU architecture gives us an advantage. NVIDIA Ampere GPUs can execute a fine-grained 2:4 sparse matrix multiplication twice as fast as its dense equivalent. In the light of this property, we comprehensively investigate the feasibility of accelerating feed-forward networks (FFNs) of Transformers in pre-training. First, we define a flip rate to monitor the stability of a 2:4 training process. Utilizing this metric, we suggest two techniques to preserve accuracy: to modify the sparse-refined straight-through estimator by applying the mask decay term on gradients, and to enhance the model's quality by a simple yet effective dense fine-tuning procedure near the end of pre-training. Besides, we devise two effective techniques to practically accelerate training: to calculate transposable 2:4 mask by convolution, and to accelerate gated activation functions by reducing GPU L2 cache miss. Experiments show that a combination of our methods reaches the best performance on multiple Transformers among different 2:4 training methods, while actual acceleration can be observed on different shapes of Transformer block.

Read more

4/3/2024