Improving Transformers with Dynamically Composable Multi-Head Attention

Read original: arXiv:2405.08553 - Published 6/5/2024 by Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan

🖼️

Overview

Multi-Head Attention (MHA) is a key component of Transformer models, but it has some shortcomings like low-rank bottleneck of attention score matrices and head redundancy.
The authors propose Dynamically Composable Multi-Head Attention (DCMHA), a more efficient attention architecture that tackles these issues and increases the expressive power of the model.
DCMHA uses a Compose function to dynamically transform the attention score and weight matrices in an input-dependent way.
DCMHA can be used as a drop-in replacement for MHA in any Transformer architecture, resulting in a corresponding DCFormer model.
DCFormer significantly outperforms Transformer on various language modeling tasks, matching the performance of models with ~1.7x-2.0x more compute.

Plain English Explanation

Transformer models are a type of neural network architecture that have revolutionized natural language processing tasks. A key component of Transformers is Multi-Head Attention (MHA), which allows the model to focus on different parts of the input when generating an output.

However, MHA has some issues. The attention heads, which work independently, can result in a low-rank bottleneck in the attention score matrices and redundancy between the heads. This means the model isn't using its full potential.

To address these problems, the researchers propose a new attention architecture called Dynamically Composable Multi-Head Attention (DCMHA). At the core of DCMHA is a Compose function that dynamically transforms the attention score and weight matrices in a way that's specific to the input. This allows DCMHA to be more efficient and expressive than standard MHA.

The researchers show that replacing the MHA component in a Transformer with DCMHA (creating a "DCFormer" model) results in significant performance improvements on language modeling tasks. In fact, a DCFormer model can match the performance of a Transformer model that has around 1.7 to 2 times more computational power.

This is an important advance, as it means we can build more capable language models without having to dramatically increase the amount of computing power required. This could lead to more efficient and accessible AI systems in the future.

Technical Explanation

The core innovation of this work is the Dynamically Composable Multi-Head Attention (DCMHA) architecture, which aims to address the limitations of standard Multi-Head Attention (MHA) used in Transformer models.

In MHA, the attention heads operate independently, which can lead to a low-rank bottleneck in the attention score matrices and redundancy between the heads. DCMHA tackles these issues by dynamically composing the attention heads in an input-dependent way.

The key component of DCMHA is the Compose function, which transforms the attention score and weight matrices. This composition is performed dynamically based on the input, allowing DCMHA to be more expressive and efficient than standard MHA.

The researchers evaluate DCMHA by using it to replace the MHA component in Transformer architectures, creating corresponding "DCFormer" models. They find that DCFormer significantly outperforms Transformer on various language modeling tasks, matching the performance of models with ~1.7x-2.0x more compute.

For example, the DCPythia-6.9B model outperforms the open-source Pythia-12B model on both pretraining perplexity and downstream task evaluation, despite having less than half the compute. This demonstrates the power of the DCMHA approach to improve model efficiency and performance.

The code and pre-trained models for DCFormer are available on GitHub.

Critical Analysis

The DCMHA approach presented in this paper is a novel and promising solution to the limitations of standard MHA in Transformer models. By dynamically composing the attention heads, the model can better capture the complex relationships in the input data, leading to significant performance improvements.

One potential limitation of the research is that it only evaluates DCMHA on language modeling tasks. While these are important benchmarks, it would be valuable to see how DCMHA performs on other natural language processing tasks, such as question answering or text generation. Additionally, exploring the application of DCMHA to other domains, such as computer vision, could further demonstrate the versatility and generalizability of the approach.

Another area for potential investigation is the interpretability of the DCMHA mechanism. Understanding how the Compose function dynamically transforms the attention matrices and what insights this provides about the model's decision-making process could be valuable for building more transparent and explainable AI systems.

Overall, the DCMHA approach presented in this paper is a significant contribution to the field of Transformer architectures, and the impressive performance gains demonstrated suggest that it is a promising direction for future research and development in efficient and powerful AI models.

Conclusion

The Dynamically Composable Multi-Head Attention (DCMHA) architecture proposed in this paper addresses key limitations of standard Multi-Head Attention (MHA) used in Transformer models. By dynamically composing the attention heads in an input-dependent way, DCMHA can achieve greater expressive power and efficiency than MHA.

The researchers show that replacing the MHA component in Transformer architectures with DCMHA (creating "DCFormer" models) results in significant performance improvements on language modeling tasks. In fact, a DCFormer model can match the performance of a Transformer model with ~1.7x-2.0x more compute, demonstrating the potential of DCMHA to enable more efficient and capable AI systems.

This work represents an important advancement in Transformer architecture design and highlights the value of exploring novel attention mechanisms to enhance the effectiveness of large language models. As the demand for powerful and efficient AI continues to grow, innovations like DCMHA will play a crucial role in shaping the future of natural language processing and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🖼️

Improving Transformers with Dynamically Composable Multi-Head Attention

Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan

Multi-Head Attention (MHA) is a key component of Transformer. In MHA, attention heads work independently, causing problems such as low-rank bottleneck of attention score matrices and head redundancy. We propose Dynamically Composable Multi-Head Attention (DCMHA), a parameter and computation efficient attention architecture that tackles the shortcomings of MHA and increases the expressive power of the model by dynamically composing attention heads. At the core of DCMHA is a $it{Compose}$ function that transforms the attention score and weight matrices in an input-dependent way. DCMHA can be used as a drop-in replacement of MHA in any transformer architecture to obtain the corresponding DCFormer. DCFormer significantly outperforms Transformer on different architectures and model scales in language modeling, matching the performance of models with ~1.7x-2.0x compute. For example, DCPythia-6.9B outperforms open source Pythia-12B on both pretraining perplexity and downstream task evaluation. The code and models are available at https://github.com/Caiyun-AI/DCFormer.

6/5/2024

DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion

Yilong Chen, Linhao Zhang, Junyuan Shang, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun

Large language models (LLMs) with billions of parameters demonstrate impressive performance. However, the widely used Multi-Head Attention (MHA) in LLMs incurs substantial computational and memory costs during inference. While some efforts have optimized attention mechanisms by pruning heads or sharing parameters among heads, these methods often lead to performance degradation or necessitate substantial continued pre-training costs to restore performance. Based on the analysis of attention redundancy, we design a Decoupled-Head Attention (DHA) mechanism. DHA adaptively configures group sharing for key heads and value heads across various layers, achieving a better balance between performance and efficiency. Inspired by the observation of clustering similar heads, we propose to progressively transform the MHA checkpoint into the DHA model through linear fusion of similar head parameters step by step, retaining the parametric knowledge of the MHA checkpoint. We construct DHA models by transforming various scales of MHA checkpoints given target head budgets. Our experiments show that DHA remarkably requires a mere 0.25% of the original model's pre-training budgets to achieve 97.6% of performance while saving 75% of KV cache. Compared to Group-Query Attention (GQA), DHA achieves a 5$times$ training acceleration, a maximum of 13.93% performance improvement under 0.01% pre-training budget, and 4% relative improvement under 0.05% pre-training budget.

6/12/2024

🖼️

Attention as a Hypernetwork

Simon Schug, Seijin Kobayashi, Yassir Akram, Jo~ao Sacramento, Razvan Pascanu

Transformers can under some circumstances generalize to novel problem instances whose constituent parts might have been encountered during training but whose compositions have not. What mechanisms underlie this ability for compositional generalization? By reformulating multi-head attention as a hypernetwork, we reveal that a low-dimensional latent code specifies key-query specific operations. We find empirically that this latent code is highly structured, capturing information about the subtasks performed by the network. Using the framework of attention as a hypernetwork we further propose a simple modification of multi-head linear attention that strengthens the ability for compositional generalization on a range of abstract reasoning tasks. In particular, we introduce a symbolic version of the Raven Progressive Matrices human intelligence test on which we demonstrate how scaling model size and data enables compositional generalization and gives rise to a functionally structured latent code in the transformer.

6/24/2024

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

109

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Piotr Nawrot, Adrian {L}a'ncucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti

Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to 7x throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. DMC preserves the original downstream performance with up to 4x cache compression, outperforming up-trained grouped-query attention (GQA) and key-value eviction policies (H$_2$O, TOVA). GQA and DMC can be even combined to obtain compounded gains. Hence, DMC can serve as a drop-in replacement for KV caching in existing LLMs to fit longer contexts and larger batches within any given memory budget.

7/24/2024