CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization

2405.14377

Published 5/24/2024 by Zi Yang, Samridhi Choudhary, Xinfeng Xie, Cao Gao, Siegfried Kunzmann, Zheng Zhang

🏋️

Abstract

Training large AI models such as deep learning recommendation systems and foundation language (or multi-modal) models costs massive GPUs and computing time. The high training cost has become only affordable to big tech companies, meanwhile also causing increasing concerns about the environmental impact. This paper presents CoMERA, a Computing- and Memory-Efficient training method via Rank-Adaptive tensor optimization. CoMERA achieves end-to-end rank-adaptive tensor-compressed training via a multi-objective optimization formulation, and improves the training to provide both a high compression ratio and excellent accuracy in the training process. Our optimized numerical computation (e.g., optimized tensorized embedding and tensor-vector contractions) and GPU implementation eliminate part of the run-time overhead in the tensorized training on GPU. This leads to, for the first time, $2-3times$ speedup per training epoch compared with standard training. CoMERA also outperforms the recent GaLore in terms of both memory and computing efficiency. Specifically, CoMERA is $2times$ faster per training epoch and $9times$ more memory-efficient than GaLore on a tested six-encoder transformer with single-batch training. With further HPC optimization, CoMERA may significantly reduce the training cost of large language models.

Create account to get full access

Overview

The paper presents CoMERA, a new method for efficiently training large AI models like recommendation systems and language models.
Training these models is extremely computationally expensive, requiring massive GPU resources, which has made it affordable only for large tech companies.
CoMERA aims to significantly reduce the training cost and environmental impact through a novel rank-adaptive tensor optimization approach.

Plain English Explanation

The development of powerful AI models, such as deep learning recommendation systems and foundation language models, has been driving impressive advancements in various fields. However, training these large models requires massive amounts of computing power and GPU resources, which can be prohibitively expensive and have a significant environmental impact.

CoMERA is a new method that addresses this challenge by optimizing the training process to be more computationally and memory-efficient. It uses a rank-adaptive tensor optimization approach, which means it can compress the tensors (multi-dimensional arrays) used in the model during training without sacrificing accuracy. This allows for faster training and reduced resource requirements.

The paper also describes optimized numerical computations and GPU implementations that further reduce the runtime overhead of the training process. As a result, CoMERA can achieve a 2-3x speedup per training epoch compared to standard training methods, as well as significant memory savings, outperforming recent approaches like GaLore.

With further optimization, the authors believe that CoMERA could significantly reduce the training cost of large language models, making them more accessible to a wider range of researchers and organizations, and potentially reducing the environmental impact of AI development.

Technical Explanation

The key innovation in the CoMERA method is the end-to-end rank-adaptive tensor-compressed training approach, which is achieved through a multi-objective optimization formulation. This allows the model to adapt the rank (complexity) of the tensors used in the training process, balancing the need for high compression ratios with maintaining excellent training accuracy.

The paper also describes several technical optimizations that contribute to the performance improvements:

Optimized tensorized embedding: The way the model represents and processes textual inputs is optimized to take advantage of the tensor-based computations.
Tensor-vector contractions: The authors have developed efficient ways of performing the key mathematical operations involved in the training process.
GPU implementation: The CoMERA method is carefully engineered to take full advantage of GPU hardware, reducing the runtime overhead.

These optimizations, combined with the rank-adaptive tensor compression, lead to the reported 2-3x speedup per training epoch and significant memory savings compared to standard training approaches, as well as outperforming recent methods like GaLore and CompactifAI.

Critical Analysis

The paper presents a strong and comprehensive approach to improving the efficiency of training large AI models. The authors have addressed a critical challenge in the field and provided a novel solution that demonstrates significant performance improvements.

However, the paper also acknowledges some limitations and areas for further research:

The experiments were conducted on a relatively small-scale transformer model, and the authors note that further work is needed to scale the approach to the largest language models.
The paper does not provide a detailed analysis of the environmental impact of the CoMERA method, which would be an important consideration given the focus on reducing the training cost and resource requirements.
The authors mention that additional HPC optimization could further improve the performance of CoMERA, but they do not provide specifics on what these optimizations might entail.

It would be valuable for future research to explore these areas in more depth and provide a more comprehensive evaluation of the real-world implications and potential benefits of the CoMERA approach.

Conclusion

The CoMERA method presented in this paper represents a significant advancement in the quest to make the training of large AI models more computationally and memory-efficient. By employing a novel rank-adaptive tensor optimization approach, along with a range of technical optimizations, the authors have demonstrated the ability to achieve substantial speedups and memory savings compared to standard training methods.

This work has the potential to make the development of powerful AI models, such as recommendation systems and language models, more accessible to a wider range of researchers and organizations, while also reducing the environmental impact of AI development. As the field of AI continues to advance, innovative approaches like CoMERA will be crucial in ensuring that these advancements are sustainable and accessible to all.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

6/4/2024

cs.LG

AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

Yifan Yang, Kai Zhen, Ershad Banijamal, Athanasios Mouchtaris, Zheng Zhang

Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks, yet it demands more and more memory as model sizes keep growing. To address this issue, the recently proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. However, significant performance drops and a high risk of divergence have limited their widespread adoption. In this paper, we propose the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods. To enhance dimension-dependent ZO estimation accuracy, we introduce a fast-forward, low-parameter tensorized adapter. To tackle the frequently observed divergence issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number schedule that guarantees convergence. Detailed theoretical analysis and extensive experimental results on Roberta-Large and Llama-2-7B models substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory efficiency, and convergence speed.

6/27/2024

cs.CL cs.AI cs.LG

ProTrain: Efficient LLM Training via Memory-Aware Techniques

Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu

It is extremely memory-hungry to train Large Language Models (LLM). To solve this problem, existing work exploits the combination of CPU and GPU for the training process, such as ZeRO-Offload. Such a technique largely democratizes billion-scale model training, making it possible to train with few consumer graphics cards. However, based on our observation, existing frameworks often provide coarse-grained memory management and require experienced experts in configuration tuning, leading to suboptimal hardware utilization and performance. This paper proposes ProTrain, a novel training system that intelligently balances memory usage and performance by coordinating memory, computation, and IO. ProTrain achieves adaptive memory management through Chunk-Based Model State Management and Block-Wise Activation Management, guided by a Memory-Aware Runtime Profiler without user intervention. ProTrain does not change the training algorithm and thus does not compromise accuracy. Experiments show that ProTrain improves training throughput by 1.43$times$ to 2.71$times$ compared to the SOTA training systems.

6/13/2024

cs.DC cs.AI cs.LG cs.PF

🤖

SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

Raghu Prabhakar, Ram Sivaramakrishnan, Darshan Gandhi, Yun Du, Mingran Wang, Xiangyu Song, Kejie Zhang, Tianren Gao, Angela Wang, Karen Li, Yongning Sheng, Joshua Brot, Denis Sokolov, Apurv Vivek, Calvin Leung, Arjun Sabnis, Jiayu Bai, Tuowen Zhao, Mark Gottscho, David Jackson, Mark Luttrell, Manish K. Shah, Edison Chen, Kaizhao Liang, Swayambhoo Jain, Urmish Thakker, Dawei Huang, Sumti Jairath, Kevin J. Brown, Kunle Olukotun

Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them. In this paper, we describe how combining CoE, streaming dataflow, and a three-tier memory system scales the AI memory wall. We describe Samba-CoE, a CoE system with 150 experts and a trillion total parameters. We deploy Samba-CoE on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) - a commercial dataflow accelerator architecture that has been co-designed for enterprise inference and training applications. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. A dedicated inter-RDU network enables scaling up and out over multiple sockets. We demonstrate speedups ranging from 2x to 13x on various benchmarks running on eight RDU sockets compared with an unfused baseline. We show that for CoE inference deployments, the 8-socket RDU Node reduces machine footprint by up to 19x, speeds up model switching time by 15x to 31x, and achieves an overall speedup of 3.7x over a DGX H100 and 6.6x over a DGX A100.

5/14/2024

cs.AR cs.AI