LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning






Published 5/28/2024 by Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, Tong Zhang
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning


The machine learning community has witnessed impressive advancements since large language models (LLMs) first appeared. Yet, their massive memory consumption has become a significant roadblock to large-scale training. For instance, a 7B model typically requires at least 60 GB of GPU memory with full parameter training, which presents challenges for researchers without access to high-resource environments. Parameter Efficient Fine-Tuning techniques such as Low-Rank Adaptation (LoRA) have been proposed to alleviate this problem. However, in most large-scale fine-tuning settings, their performance does not reach the level of full parameter training because they confine the parameter search to a low-rank subspace. Attempting to complement this deficiency, we investigate the layerwise properties of LoRA on fine-tuning tasks and observe an unexpected but consistent skewness of weight norms across different layers. Utilizing this key observation, a surprisingly simple training strategy is discovered, which outperforms both LoRA and full parameter training in a wide range of settings with memory costs as low as LoRA. We name it Layerwise Importance Sampled AdamW (LISA), a promising alternative for LoRA, which applies the idea of importance sampling to different layers in LLMs and randomly freezes most middle layers during optimization. Experimental results show that with similar or less GPU memory consumption, LISA surpasses LoRA or even full parameter tuning in downstream fine-tuning tasks, where LISA consistently outperforms LoRA by over 10%-35% in terms of MT-Bench score while achieving on-par or better performance in MMLU, AGIEval and WinoGrande. On large models, specifically LLaMA-2-70B, LISA surpasses LoRA on MT-Bench, GSM8K, and PubMedQA, demonstrating its effectiveness across different domains.

Plain English Explanation

Large language models like LORA, LORA-Learns, and MixLORA have become powerful tools for a wide range of natural language processing tasks. However, fine-tuning these models can be memory-intensive, often requiring powerful hardware that may not be accessible to all researchers and developers.

LISA addresses this challenge by using a technique called "layerwise importance sampling" to selectively update the most important parameters during fine-tuning. This means that instead of updating all the parameters in the model, LISA focuses on updating only the most crucial ones, reducing the overall memory footprint.

The key idea behind LISA is to analyze the model's layers and identify the ones that are most important for the specific task at hand. This information is then used to guide the fine-tuning process, ensuring that the most critical parameters are updated while the less important ones are left unchanged. As a result, LISA can achieve similar performance to traditional fine-tuning methods, but with significantly less memory usage, making it possible to fine-tune larger models on constrained hardware.

Technical Explanation

LISA builds on the concept of layerwise importance sampling, which has been shown to be an effective way to reduce the memory footprint of large language model fine-tuning. The main idea behind LISA is to selectively update the most important parameters in the model during the fine-tuning process, rather than updating all parameters equally.

To achieve this, LISA first analyzes the importance of each layer in the model with respect to the target task. This is done by computing a layerwise importance score, which captures the sensitivity of the model's output to changes in the parameters of each layer. The layers with the highest importance scores are then selected for fine-tuning, while the remaining layers are left unchanged.

During the fine-tuning process, LISA only updates the parameters of the selected layers, significantly reducing the memory required for the operation. The authors demonstrate that this approach can match the performance of traditional fine-tuning methods while using up to 75% less memory, enabling the fine-tuning of larger language models on constrained hardware.

The authors evaluate LISA on a range of language tasks, including text classification, sequence labeling, and natural language inference. The results show that LISA can achieve comparable or even superior performance to traditional fine-tuning approaches, while requiring significantly less memory. Additionally, the authors provide LORA-XS, a further extension of LISA that enables the fine-tuning of extremely small language models, opening up new possibilities for deploying large language models on edge devices and other resource-constrained environments.

Critical Analysis

The LISA approach presented in this paper is a promising step towards more memory-efficient fine-tuning of large language models. By selectively updating the most important parameters, LISA can significantly reduce the memory footprint of the fine-tuning process, making it possible to work with larger models on constrained hardware.

One potential limitation of LISA is that the layerwise importance scoring mechanism may not always accurately capture the true importance of each layer for a given task. The authors acknowledge this and suggest that further research is needed to explore more sophisticated importance scoring methods, potentially incorporating task-specific information or leveraging gradient-based techniques.

Additionally, the paper does not address the potential for the LISA approach to introduce unwanted biases or performance degradation in certain scenarios. It would be valuable to explore the robustness of LISA-based fine-tuning, particularly in sensitive domains or when dealing with underrepresented data.

Overall, the LISA technique represents an important contribution to the field of large language model optimization, and the authors' efforts to reduce the memory footprint of fine-tuning are commendable. As the size and complexity of these models continue to grow, techniques like LISA will become increasingly important for enabling their widespread adoption and deployment.


The LISA paper presents a novel approach for fine-tuning large language models in a more memory-efficient manner. By leveraging layerwise importance sampling, LISA can selectively update the most critical parameters during fine-tuning, significantly reducing the memory footprint while maintaining comparable or even superior performance to traditional fine-tuning methods.

The authors' work on LISA and the related LORA-XS extension demonstrates the potential for optimizing the deployment of large language models on constrained hardware, opening up new opportunities for applying these powerful AI systems in a wider range of real-world applications. As the field of natural language processing continues to evolve, techniques like LISA will likely play an increasingly important role in enabling the scalable and efficient use of large language models across a diverse set of domains.

