You Only Cache Once: Decoder-Decoder Architectures for Language Models

2405.05254

YC

3

Reddit

0

Published 5/10/2024 by Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei
You Only Cache Once: Decoder-Decoder Architectures for Language Models

Abstract

We introduce a decoder-decoder architecture, YOCO, for large language models, which only caches key-value pairs once. It consists of two components, i.e., a cross-decoder stacked upon a self-decoder. The self-decoder efficiently encodes global key-value (KV) caches that are reused by the cross-decoder via cross-attention. The overall model behaves like a decoder-only Transformer, although YOCO only caches once. The design substantially reduces GPU memory demands, yet retains global attention capability. Additionally, the computation flow enables prefilling to early exit without changing the final output, thereby significantly speeding up the prefill stage. Experimental results demonstrate that YOCO achieves favorable performance compared to Transformer in various settings of scaling up model size and number of training tokens. We also extend YOCO to 1M context length with near-perfect needle retrieval accuracy. The profiling results show that YOCO improves inference memory, prefill latency, and throughput by orders of magnitude across context lengths and model sizes. Code is available at https://aka.ms/YOCO.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper introduces a new language model architecture called "You Only Cache Once" (YOCO), which aims to improve the efficiency and performance of language models by using a decoder-only architecture.
  • The key idea behind YOCO is to cache the output of the encoder and reuse it during decoding, rather than recomputing the encoder outputs for each target token.
  • This approach is designed to reduce the computational cost and memory footprint of language models, making them more efficient and scalable.

Plain English Explanation

The paper presents a new way to design language models, which are artificial intelligence systems that can understand and generate human language. The traditional approach to building language models involves an "encoder-decoder" architecture, where the encoder processes the input text and the decoder generates the output text.

The researchers behind this paper, however, have come up with a different approach called "You Only Cache Once" (YOCO). The key idea is to store, or "cache," the output of the encoder and reuse it during the decoding process, rather than recomputing the encoder outputs for each target token. This helps to reduce the computational cost and memory requirements of the language model, making it more efficient and scalable.

By caching the encoder outputs, the YOCO architecture can avoid the need to repeatedly process the same information, which can be a significant bottleneck in traditional language models. This approach is particularly useful for tasks like text generation, where the model needs to generate long sequences of text.

Technical Explanation

The YOCO architecture is a decoder-only model that builds upon the successes of Towards Smaller, Faster Decoder-Only Transformers and Decoder-Only Foundation Model for Time Series Forecasting. The key innovation of YOCO is the introduction of a caching mechanism that stores the output of the encoder and reuses it during the decoding process.

The YOCO model consists of a shared encoder network, a decoder network, and a caching mechanism. The encoder network processes the input text and produces a sequence of hidden states, which are then cached and reused by the decoder network during text generation. This caching approach allows the decoder to avoid the need to recompute the encoder outputs, which can be a significant source of computational overhead in traditional language models.

The researchers evaluate the YOCO architecture on a range of language modeling tasks, including machine translation and text generation. Their results show that YOCO can achieve comparable or better performance than traditional encoder-decoder models, while also reducing the computational cost and memory requirements of the model.

Critical Analysis

The YOCO architecture represents an interesting approach to improving the efficiency of language models, and the researchers' results suggest that it can be a viable alternative to traditional encoder-decoder architectures. However, there are a few potential limitations and areas for further research that could be explored:

  • The caching mechanism introduced in YOCO may not be effective for all types of language modeling tasks, particularly those that require more complex interactions between the input and output sequences. LLOCO: Learning Long Contexts Offline and SNAP-KV: LLM Knows What You Are Looking have explored alternative approaches to handling long-range dependencies in language models.

  • The YOCO architecture may also be less effective for tasks that require more dynamic or flexible processing of the input, such as MO-YOLO: End-to-End Multiple Object detection or other multi-task learning scenarios.

  • It would be interesting to see how the YOCO architecture compares to other decoder-only models, such as the ones explored in Towards Smaller, Faster Decoder-Only Transformers, in terms of performance, efficiency, and scalability.

Overall, the YOCO architecture represents a promising approach to improving the efficiency of language models, and the researchers' work provides valuable insights into the design and optimization of these important AI systems.

Conclusion

The YOCO architecture introduced in this paper represents a novel approach to improving the efficiency and performance of language models. By caching the encoder outputs and reusing them during the decoding process, YOCO can reduce the computational cost and memory footprint of language models, making them more scalable and practical for real-world applications.

While the YOCO architecture may not be a perfect solution for all language modeling tasks, the researchers' work highlights the importance of continued innovation in this field. As AI systems become more powerful and ubiquitous, it is crucial that we find ways to make them more efficient and sustainable, without sacrificing their performance or capabilities. The YOCO architecture is a promising step in this direction, and it will be interesting to see how it evolves and is applied in future language modeling research and applications.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Haoyi Wu, Kewei Tu

YC

0

Reddit

0

Huge memory consumption has been a major bottleneck for deploying high-throughput large language models in real-world applications. In addition to the large number of parameters, the key-value (KV) cache for the attention mechanism in the transformer architecture consumes a significant amount of memory, especially when the number of layers is large for deep language models. In this paper, we propose a novel method that only computes and caches the KVs of a small number of layers, thus significantly saving memory consumption and improving inference throughput. Our experiments on large language models show that our method achieves up to 26$times$ higher throughput than standard transformers and competitive performance in language modeling and downstream tasks. In addition, our method is orthogonal to existing transformer memory-saving techniques, so it is straightforward to integrate them with our model, achieving further improvement in inference efficiency. Our code is available at https://github.com/whyNLP/LCKV.

Read more

6/5/2024

🎲

Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster

Hongxuan Zhang, Zhining Liu, Yao Zhao, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen

YC

0

Reddit

0

In this work, we propose FastCoT, a model-agnostic framework based on parallel decoding without any further training of an auxiliary model or modification to the LLM itself. FastCoT uses a size-varying context window whose size changes with position to conduct parallel decoding and auto-regressive decoding simultaneously, thus fully utilizing GPU computation resources. In FastCoT, the parallel decoding part provides the LLM with a quick glance of the future composed of approximate tokens, which could lead to faster answers compared to regular autoregressive decoding used by causal transformers. We also provide an implementation of parallel decoding within LLM, which supports KV-cache generation and batch processing. Through extensive experiments, we demonstrate that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach. Additionally, we show that the context window size exhibits considerable robustness for different tasks.

Read more

6/5/2024

LoCoCo: Dropping In Convolutions for Long Context Compression

LoCoCo: Dropping In Convolutions for Long Context Compression

Ruisi Cai, Yuandong Tian, Zhangyang Wang, Beidi Chen

YC

0

Reddit

0

This paper tackles the memory hurdle of processing long context sequences in Large Language Models (LLMs), by presenting a novel approach, Dropping In Convolutions for Long Context Compression (LoCoCo). LoCoCo employs only a fixed-size Key-Value (KV) cache, and can enhance efficiency in both inference and fine-tuning stages. Diverging from prior methods that selectively drop KV pairs based on heuristics, LoCoCo leverages a data-driven adaptive fusion technique, blending previous KV pairs with incoming tokens to minimize the loss of contextual information and ensure accurate attention modeling. This token integration is achieved through injecting one-dimensional convolutional kernels that dynamically calculate mixing weights for each KV cache slot. Designed for broad compatibility with existing LLM frameworks, LoCoCo allows for straightforward drop-in integration without needing architectural modifications, while incurring minimal tuning overhead. Experiments demonstrate that LoCoCo maintains consistently outstanding performance across various context lengths and can achieve a high context compression rate during both inference and fine-tuning phases. During inference, we successfully compressed up to 3482 tokens into a 128-size KV cache, while retaining comparable performance to the full sequence - an accuracy improvement of up to 0.2791 compared to baselines at the same cache size. During post-training tuning, we also effectively extended the context length from 4K to 32K using a KV cache of fixed size 512, achieving performance similar to fine-tuning with entire sequences.

Read more

6/11/2024

🔮

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly

YC

0

Reddit

0

Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence lengths and large batch sizes. Since the invention of the transformer, two of the most effective interventions discovered for reducing the size of the KV cache have been Multi-Query Attention (MQA) and its generalization, Grouped-Query Attention (GQA). MQA and GQA both modify the design of the attention block so that multiple query heads can share a single key/value head, reducing the number of distinct key/value heads by a large factor while only minimally degrading accuracy. In this paper, we show that it is possible to take Multi-Query Attention a step further by also sharing key and value heads between adjacent layers, yielding a new attention design we call Cross-Layer Attention (CLA). With CLA, we find that it is possible to reduce the size of the KV cache by another 2x while maintaining nearly the same accuracy as unmodified MQA. In experiments training 1B- and 3B-parameter models from scratch, we demonstrate that CLA provides a Pareto improvement over the memory/accuracy tradeoffs which are possible with traditional MQA, enabling inference with longer sequence lengths and larger batch sizes than would otherwise be possible

Read more

5/22/2024