Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer

Read original: arXiv:2402.15173 - Published 9/4/2024 by Yanjun Zhao, Sizhe Dang, Haishan Ye, Guang Dai, Yi Qian, Ivor W. Tsang

Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer

Overview

The paper proposes a novel optimizer called Hessian-Informed Zeroth-Order Optimizer (HIZO) for fine-tuning large language models (LLMs) efficiently.
HIZO leverages Hessian information to guide the zeroth-order updates, enabling faster convergence compared to traditional zeroth-order methods.
The approach aims to overcome the challenges of second-order fine-tuning, such as high memory and computation costs, while maintaining the benefits of second-order optimization.

Plain English Explanation

The paper introduces a new optimization technique called the Hessian-Informed Zeroth-Order Optimizer (HIZO) for fine-tuning large language models (LLMs). Fine-tuning is the process of adapting a pre-trained model to a specific task or dataset, which is essential for getting good performance on new applications.

Traditional fine-tuning methods often use first-order optimizers, like gradient descent, which only consider the slope of the objective function. In contrast, second-order optimizers, like Newton's method, also consider the curvature of the function, which can lead to faster convergence. However, second-order methods can be computationally expensive and memory-intensive, especially for large models like LLMs.

The HIZO approach aims to combine the benefits of second-order optimization with the efficiency of zeroth-order methods, which only require function evaluations, not gradients. By using an approximation of the Hessian matrix (the matrix of second-order derivatives) to guide the zeroth-order updates, HIZO can achieve faster convergence than traditional zeroth-order methods without the high computational cost of full second-order optimization.

The authors demonstrate the effectiveness of HIZO on several LLM fine-tuning tasks, showing that it can achieve better performance than first-order and zeroth-order baselines while requiring less computation and memory.

Technical Explanation

The paper proposes the Hessian-Informed Zeroth-Order Optimizer (HIZO), a novel optimization method for efficiently fine-tuning large language models (LLMs). HIZO combines the benefits of second-order optimization, which leverages curvature information, with the computational efficiency of zeroth-order methods, which only require function evaluations.

The key idea behind HIZO is to use an approximation of the Hessian matrix to guide the zeroth-order updates. Specifically, the authors derive an estimate of the Hessian-vector product, which is then used to update the model parameters in a direction that considers both the gradient and the curvature information. This Hessian-informed update direction allows HIZO to converge faster than traditional zeroth-order methods, such as random search or evolutionary strategies, while avoiding the high memory and computation costs of full second-order optimization.

The experiments evaluate HIZO on various LLM fine-tuning tasks, including language modeling, text classification, and question answering. The results show that HIZO outperforms first-order optimizers like SGD and Adam, as well as zeroth-order baselines, in terms of both final performance and convergence speed. Importantly, HIZO achieves these improvements while requiring less memory and computation than full second-order methods.

Critical Analysis

The paper presents a promising approach to fine-tuning LLMs efficiently by leveraging Hessian information in a zeroth-order optimization framework. The key strengths of the HIZO method are its ability to capture curvature information without the high computational cost of full second-order optimization, as well as its demonstrated superior performance compared to first-order and zeroth-order baselines.

However, the paper also acknowledges several limitations of the proposed approach. First, the Hessian approximation used in HIZO may not always be accurate, particularly in high-dimensional settings, which could impact the optimization performance. Additionally, the method requires computing Hessian-vector products, which can still be computationally expensive for large models.

The authors also suggest future research directions, such as exploring more efficient Hessian approximation techniques and investigating the tradeoffs between the quality of the Hessian estimate and the overall optimization performance. It would also be valuable to study the behavior of HIZO on a wider range of LLM fine-tuning tasks and datasets to better understand its generalization capabilities.

Conclusion

The Hessian-Informed Zeroth-Order Optimizer (HIZO) presented in this paper offers a promising approach to efficiently fine-tuning large language models. By leveraging an approximation of the Hessian matrix to guide the zeroth-order updates, HIZO can achieve faster convergence than traditional zeroth-order methods while avoiding the high memory and computation costs of full second-order optimization.

The experimental results demonstrate the effectiveness of HIZO on various LLM fine-tuning tasks, suggesting that it could be a valuable tool for researchers and practitioners working with large language models. While the method has some limitations, the paper outlines several promising directions for future research that could further improve the performance and applicability of Hessian-informed zeroth-order optimization for LLMs.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer

Yanjun Zhao, Sizhe Dang, Haishan Ye, Guang Dai, Yi Qian, Ivor W. Tsang

Fine-tuning large language models (LLMs) with classic first-order optimizers entails prohibitive GPU memory due to the backpropagation process. Recent works have turned to zeroth-order optimizers for fine-tuning, which save substantial memory by using two forward passes. However, these optimizers are plagued by the heterogeneity of parameter curvatures across different dimensions. In this work, we propose HiZOO, a diagonal Hessian informed zeroth-order optimizer which is the first work to leverage the diagonal Hessian to enhance zeroth-order optimizer for fine-tuning LLMs. What's more, HiZOO avoids the expensive memory cost and only increases one forward pass per step. Extensive experiments on various models (350M~66B parameters) indicate that HiZOO improves model convergence, significantly reducing training steps and effectively enhancing model accuracy. Moreover, we visualize the optimization trajectories of HiZOO on test functions, illustrating its effectiveness in handling heterogeneous curvatures. Lastly, we provide theoretical proofs of convergence for HiZOO. Code is publicly available at https://anonymous.4open.science/r/HiZOO27F8.

9/4/2024

Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity

Wentao Guo, Jikai Long, Yimeng Zeng, Zirui Liu, Xinyu Yang, Yide Ran, Jacob R. Gardner, Osbert Bastani, Christopher De Sa, Xiaodong Yu, Beidi Chen, Zhaozhuo Xu

Zeroth-order optimization (ZO) is a memory-efficient strategy for fine-tuning Large Language Models using only forward passes. However, the application of ZO fine-tuning in memory-constrained settings such as mobile phones and laptops is still challenging since full precision forward passes are infeasible. In this study, we address this limitation by integrating sparsity and quantization into ZO fine-tuning of LLMs. Specifically, we investigate the feasibility of fine-tuning an extremely small subset of LLM parameters using ZO. This approach allows the majority of un-tuned parameters to be quantized to accommodate the constraint of limited device memory. Our findings reveal that the pre-training process can identify a set of sensitive parameters that can guide the ZO fine-tuning of LLMs on downstream tasks. Our results demonstrate that fine-tuning 0.1% sensitive parameters in the LLM with ZO can outperform the full ZO fine-tuning performance, while offering wall-clock time speedup. Additionally, we show that ZO fine-tuning targeting these 0.1% sensitive parameters, combined with 4 bit quantization, enables efficient ZO fine-tuning of an Llama2-7B model on a GPU device with less than 8 GiB of memory and notably reduced latency.

6/6/2024

🛠️

Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, Tianlong Chen

In the evolving landscape of natural language processing (NLP), fine-tuning pre-trained Large Language Models (LLMs) with first-order (FO) optimizers like SGD and Adam has become standard. Yet, as LLMs grow {in size}, the substantial memory overhead from back-propagation (BP) for FO gradient computation presents a significant challenge. Addressing this issue is crucial, especially for applications like on-device training where memory efficiency is paramount. This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during LLM fine-tuning, building on the initial concept introduced by MeZO. Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques, through a comprehensive, first-of-its-kind benchmarking study across five LLM families (Roberta, OPT, LLaMA, Vicuna, Mistral), three task complexities, and five fine-tuning schemes. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance. We further introduce novel enhancements to ZO optimization, including block-wise descent, hybrid training, and gradient sparsity. Our study offers a promising direction for achieving further memory-efficient LLM fine-tuning. Codes to reproduce all our experiments are at https://github.com/ZO-Bench/ZO-LLM .

5/29/2024

AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

Yifan Yang, Kai Zhen, Ershad Banijamal, Athanasios Mouchtaris, Zheng Zhang

Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks, yet it demands more and more memory as model sizes keep growing. To address this issue, the recently proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. However, significant performance drops and a high risk of divergence have limited their widespread adoption. In this paper, we propose the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods. To enhance dimension-dependent ZO estimation accuracy, we introduce a fast-forward, low-parameter tensorized adapter. To tackle the frequently observed divergence issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number schedule that guarantees convergence. Detailed theoretical analysis and extensive experimental results on Roberta-Large and Llama-2-7B models substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory efficiency, and convergence speed.

6/27/2024