You Only Cache Once: Decoder-Decoder Architectures for Language Models

Published 5/10/2024 by Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang and 2 more...

Overview

This paper introduces a new language model architecture called "You Only Cache Once" (YOCO), which aims to improve the efficiency and performance of language models by using a decoder-only architecture.
The key idea behind YOCO is to cache the output of the encoder and reuse it during decoding, rather than recomputing the encoder outputs for each target token.
This approach is designed to reduce the computational cost and memory footprint of language models, making them more efficient and scalable.

Decoder-decoder architecture reduces cache memory and prefilling time.

1/3

Original caption: Figure 1: We propose a decoder-decoder architecture, YOCO, for large language model, which only caches key/value once. YOCO markedly reduces the KV cache memory and the prefilling time, while being scalable in terms of training tokens, model size, and context length. The inference cost is reported to be 512K as the context length, and Figures 6, 7, 8 and 9 present more results for different lengths.

Plain English Explanation

The paper presents a new way to design language models, which are artificial intelligence systems that can understand and generate human language. The traditional approach to building language models involves an "encoder-decoder" architecture, where the encoder processes the input text and the decoder generates the output text.

The researchers behind this paper, however, have come up with a different approach called "You Only Cache Once" (YOCO). The key idea is to store, or "cache," the output of the encoder and reuse it during the decoding process, rather than recomputing the encoder outputs for each target token. This helps to reduce the computational cost and memory requirements of the language model, making it more efficient and scalable.

By caching the encoder outputs, the YOCO architecture can avoid the need to repeatedly process the same information, which can be a significant bottleneck in traditional language models. This approach is particularly useful for tasks like text generation, where the model needs to generate long sequences of text.

Technical Explanation

The YOCO architecture is a decoder-only model that builds upon the successes of

Towards Smaller, Faster Decoder-Only Transformers

and

Decoder-Only Foundation Model for Time Series Forecasting

. The key innovation of YOCO is the introduction of a caching mechanism that stores the output of the encoder and reuses it during the decoding process.

The YOCO model consists of a shared encoder network, a decoder network, and a caching mechanism. The encoder network processes the input text and produces a sequence of hidden states, which are then cached and reused by the decoder network during text generation. This caching approach allows the decoder to avoid the need to recompute the encoder outputs, which can be a significant source of computational overhead in traditional language models.

The researchers evaluate the YOCO architecture on a range of language modeling tasks, including machine translation and text generation. Their results show that YOCO can achieve comparable or better performance than traditional encoder-decoder models, while also reducing the computational cost and memory requirements of the model.

Critical Analysis

The YOCO architecture represents an interesting approach to improving the efficiency of language models, and the researchers' results suggest that it can be a viable alternative to traditional encoder-decoder architectures. However, there are a few potential limitations and areas for further research that could be explored:

The caching mechanism introduced in YOCO may not be effective for all types of language modeling tasks, particularly those that require more complex interactions between the input and output sequences.
LLOCO: Learning Long Contexts Offline
and
SNAP-KV: LLM Knows What You Are Looking
have explored alternative approaches to handling long-range dependencies in language models.
The YOCO architecture may also be less effective for tasks that require more dynamic or flexible processing of the input, such as
MO-YOLO: End-to-End Multiple Object
detection or other multi-task learning scenarios.
It would be interesting to see how the YOCO architecture compares to other decoder-only models, such as the ones explored in
Towards Smaller, Faster Decoder-Only Transformers
, in terms of performance, efficiency, and scalability.

Overall, the YOCO architecture represents a promising approach to improving the efficiency of language models, and the researchers' work provides valuable insights into the design and optimization of these important AI systems.

Conclusion

The YOCO architecture introduced in this paper represents a novel approach to improving the efficiency and performance of language models. By caching the encoder outputs and reusing them during the decoding process, YOCO can reduce the computational cost and memory footprint of language models, making them more scalable and practical for real-world applications.

While the YOCO architecture may not be a perfect solution for all language modeling tasks, the researchers' work highlights the importance of continued innovation in this field. As AI systems become more powerful and ubiquitous, it is crucial that we find ways to make them more efficient and sustainable, without sacrificing their performance or capabilities. The YOCO architecture is a promising step in this direction, and it will be interesting to see how it evolves and is applied in future language modeling research and applications.

Inference method using parallel prefill and sequential generation. Early exit possible without output change.

1/2

Algorithm	Time Complexity
KV Cache	O(1)
Transformer	O(N log N)
YOCO	O((N+L)D)

Original caption: Table 1: YOCO Inference. Prefill: encode input tokens in parallel. Generation: decode output tokens one by one. The computation flow enables prefilling to early exit without changing the final output, thereby significantly speeding up the prefill stage.

Model	ARC-C	ARC-E	BoolQ	Hellaswag	OBQA	PIQA	Winogrande	SciQ	Avg
Training with 1T tokens
OpenLLaMA-3B-v2	0.339	0.676	0.657	0.700	0.260	0.767	0.629	0.924	0.619
StableLM-base-alpha-3B-v2	0.324	0.673	0.646	0.686	0.264	0.760	0.621	0.921	0.612
StableLM-3B-4E1T	—	0.666	—	—	—	0.768	0.632	0.914	—
YOCO-3B	0.379	0.731	0.645	0.689	0.298	0.763	0.639	0.924	0.634
Training with 1.6T tokens
StableLM-3B-4E1T	—	0.688	—	—	—	0.762	0.627	0.913	—
YOCO-3B	0.396	0.733	0.644	0.698	0.300	0.764	0.631	0.921	0.636
Extending context length to 1M tokens
YOCO-3B-1M	0.413	0.747	0.638	0.705	0.300	0.773	0.651	0.932	0.645

Original caption: Table 4: Eval Harness [13] results compared with previous well-trained Transformer language models [32, 35, 12]. We scale the 3B model to 1.6 trillion training tokens. The 1T and 1.6T results of StableLM-3B-4E1T are taken from its technical report [32]. YOCO-3B-1M is extended to the context length of 1M tokens.

Full paper

Loading PDF viewer...

Read original: arXiv:2405.05254

Listen to this paper