DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion

Read original: arXiv:2406.06567 - Published 6/12/2024 by Yilong Chen, Linhao Zhang, Junyuan Shang, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun

DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion

Overview

This paper introduces a novel technique called "Decoupled-Head Attention" (DHA) that can extract efficient attention heads from pre-trained Transformer models.
DHA leverages "Adaptive Heads Fusion" to identify and retain the most important attention heads, while discarding redundant ones.
The authors demonstrate that DHA can achieve significant inference speedups on large language models (LLMs) without compromising performance.

Plain English Explanation

Transformer models, such as those used in large language models (LLMs), are powerful but can be computationally expensive to run. The key component of Transformer models is the attention mechanism, which allows the model to focus on the most relevant parts of the input when generating output.

The Improving Transformers: Dynamically Composable Multi-Head Attention paper introduced the idea of "attention heads", which are individual components of the attention mechanism. The CHAI: Clustered Head Attention for Efficient LLM Inference and Effectively Compress KV-Heads in LLM papers further explored methods for efficiently compressing and managing these attention heads.

The DHA technique proposed in this paper builds on these ideas. It aims to identify the most important attention heads in a pre-trained Transformer model and "decouple" them from the rest of the model. This allows the model to maintain its performance while significantly reducing the computational cost of running the attention mechanism.

The key innovation of DHA is the "Adaptive Heads Fusion" process, which analyzes the pre-trained model to determine which attention heads are redundant or less important, and then merges these heads together to create a more efficient attention mechanism.

By applying DHA, the authors were able to achieve substantial inference speedups on large language models, without sacrificing the model's performance on various tasks. This could have important implications for deploying LLMs on resource-constrained devices or in real-time applications.

Technical Explanation

The DHA technique proposed in this paper aims to learn an efficient attention mechanism from pre-trained Transformer checkpoints. The authors leverage the observation that not all attention heads in a Transformer model are equally important, and that redundant heads can be removed or compressed without significantly impacting the model's performance.

The DHA approach consists of three key steps:

Attention Head Importance Ranking: The authors develop a metric to assess the importance of each attention head in the pre-trained Transformer model. This metric considers factors such as the magnitude of the attention weights and the overall contribution of the head to the model's output.
Adaptive Heads Fusion: Based on the importance ranking, the authors identify the least important attention heads and merge them together using a novel "Adaptive Heads Fusion" technique. This process adaptively determines the optimal number of attention heads to retain, striking a balance between model efficiency and performance.
Knowledge Distillation: The authors use a knowledge distillation approach to train the DHA-based model, ensuring that the compressed attention mechanism can accurately mimic the behavior of the original pre-trained Transformer.

The authors evaluate the DHA technique on a variety of large language models, including GPT-2, GPT-3, and BERT. They demonstrate that DHA can achieve significant inference speedups, ranging from 1.5x to 3x, while maintaining the original model's performance on various benchmarks.

The Reducing Transformer Key-Value Cache Size via Cross-Attention Decomposition and Efficient and Economic Large Language Model Inference via Attention Decomposition papers have also explored techniques for improving the efficiency of Transformer-based models, such as through attention decomposition and cache size reduction.

Critical Analysis

The DHA technique presented in this paper is a promising approach for improving the efficiency of large language models without compromising their performance. The authors have demonstrated the effectiveness of their method across a range of Transformer-based models, which is a significant contribution.

One potential limitation of the DHA approach is that it relies on the pre-trained Transformer checkpoints, which may not be available or easily accessible for all models. Additionally, the authors do not provide extensive details on the computational complexity of the Adaptive Heads Fusion process, which could be an important consideration for real-world deployment.

Furthermore, the authors do not explore the impact of DHA on the model's generalization capabilities or its ability to handle diverse input distributions. It would be valuable to understand how the compressed attention mechanism affects the model's robustness and ability to adapt to new tasks or domains.

Despite these potential drawbacks, the DHA technique represents an important step forward in improving the efficiency of large language models. The authors' work highlights the potential for further research in this area, particularly in developing more advanced techniques for attention head management and compression.

Conclusion

The DHA paper presents a novel approach for extracting efficient attention mechanisms from pre-trained Transformer models. By leveraging the Adaptive Heads Fusion technique, the authors demonstrate that it is possible to significantly improve the inference speed of large language models without sacrificing their performance.

This research has important implications for the deployment of LLMs in resource-constrained environments, such as edge devices or real-time applications. The ability to maintain model accuracy while reducing computational requirements could enable a wider range of applications and facilitate the adoption of advanced language models in practical settings.

As the field of large language models continues to evolve, techniques like DHA will likely play an increasingly important role in making these powerful systems more accessible and efficient. The authors' work serves as a valuable contribution to the ongoing efforts to optimize the performance and scalability of Transformer-based models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion

Yilong Chen, Linhao Zhang, Junyuan Shang, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun

Large language models (LLMs) with billions of parameters demonstrate impressive performance. However, the widely used Multi-Head Attention (MHA) in LLMs incurs substantial computational and memory costs during inference. While some efforts have optimized attention mechanisms by pruning heads or sharing parameters among heads, these methods often lead to performance degradation or necessitate substantial continued pre-training costs to restore performance. Based on the analysis of attention redundancy, we design a Decoupled-Head Attention (DHA) mechanism. DHA adaptively configures group sharing for key heads and value heads across various layers, achieving a better balance between performance and efficiency. Inspired by the observation of clustering similar heads, we propose to progressively transform the MHA checkpoint into the DHA model through linear fusion of similar head parameters step by step, retaining the parametric knowledge of the MHA checkpoint. We construct DHA models by transforming various scales of MHA checkpoints given target head budgets. Our experiments show that DHA remarkably requires a mere 0.25% of the original model's pre-training budgets to achieve 97.6% of performance while saving 75% of KV cache. Compared to Group-Query Attention (GQA), DHA achieves a 5$times$ training acceleration, a maximum of 13.93% performance improvement under 0.01% pre-training budget, and 4% relative improvement under 0.05% pre-training budget.

6/12/2024

🖼️

Improving Transformers with Dynamically Composable Multi-Head Attention

Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan

Multi-Head Attention (MHA) is a key component of Transformer. In MHA, attention heads work independently, causing problems such as low-rank bottleneck of attention score matrices and head redundancy. We propose Dynamically Composable Multi-Head Attention (DCMHA), a parameter and computation efficient attention architecture that tackles the shortcomings of MHA and increases the expressive power of the model by dynamically composing attention heads. At the core of DCMHA is a $it{Compose}$ function that transforms the attention score and weight matrices in an input-dependent way. DCMHA can be used as a drop-in replacement of MHA in any transformer architecture to obtain the corresponding DCFormer. DCFormer significantly outperforms Transformer on different architectures and model scales in language modeling, matching the performance of models with ~1.7x-2.0x compute. For example, DCPythia-6.9B outperforms open source Pythia-12B on both pretraining perplexity and downstream task evaluation. The code and models are available at https://github.com/Caiyun-AI/DCFormer.

6/5/2024

Optimised Grouped-Query Attention Mechanism for Transformers

Yuang Chen, Cheng Zhang, Xitong Gao, Robert D. Mullins, George A. Constantinides, Yiren Zhao

Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency.

6/24/2024

Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention

Zohaib Khan, Muhammad Khaquan, Omer Tafveez, Burhanuddin Samiwala, Agha Ali Raza

The Transformer architecture has revolutionized deep learning through its Self-Attention mechanism, which effectively captures contextual information. However, the memory footprint of Self-Attention presents significant challenges for long-sequence tasks. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads - reducing the number of overall parameters and memory requirements in a flexible manner without adversely compromising model accuracy. In this work, we introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping: Key-Distributed GQA (KDGQA) and Dynamic Key-Distributed GQA (DGQA), which leverage information from the norms of the key heads to inform query allocation. Specifically, KDGQA looks at the ratios of the norms of the key heads during each forward pass, while DGQA examines the ratios of the norms as they evolve through training. Additionally, we present Perturbed GQA (PGQA) as a case-study, which introduces variability in (static) group formation via subtracting noise from the attention maps. Our experiments with up-trained Vision Transformers, for Image Classification on datasets such as CIFAR-10, CIFAR-100, Food101, and Tiny ImageNet, demonstrate the promise of these variants in improving upon the original GQA through more informed and adaptive grouping mechanisms: specifically ViT-L experiences accuracy gains of up to 8% when utilizing DGQA in comparison to GQA and other variants. We further analyze the impact of the number of Key-Value Heads on performance, underscoring the importance of utilizing query-key affinities. Code is available on GitHub.

8/29/2024