Accelerating Speculative Decoding using Dynamic Speculation Length

Read original: arXiv:2405.04304 - Published 6/26/2024 by Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, Roy Schwartz

👀

Overview

This paper introduces a novel method called DISCO (DynamIc SpeCulation length Optimization) that improves the performance of speculative decoding for large language models.
Speculative decoding is a technique to reduce the inference latency of large language models by generating draft output in parallel with the main decoding process.
The key insight of this paper is that using a fixed speculation length (the number of tokens generated by the draft model) for all decoding iterations is suboptimal, and a dynamic approach can lead to better performance.

Plain English Explanation

Speculative decoding is a technique used to speed up the process of generating text with large language models. These models can take a long time to produce a full output, so the idea behind speculative decoding is to generate a draft version of the output in parallel, while the main model is still running.

The effectiveness of this technique depends on the "speculation length" - how many tokens (words) the draft model generates at each step. Most existing approaches use the same speculation length for all steps, but the authors of this paper found that this is not optimal. They developed a new method called DISCO that uses a classifier to dynamically adjust the speculation length at each step, based on the current context.

This dynamic approach helps to preserve the quality of the final output while achieving significant speed improvements - the paper reports average speedup gains of 10.3% compared to previous methods.

Technical Explanation

The key innovation in this paper is the DISCO (DynamIc SpeCulation length Optimization) method, which dynamically adjusts the speculation length (SL) at each iteration of the speculative decoding process.

In traditional speculative decoding, the SL is kept constant throughout the decoding process. The authors show that this is suboptimal, and that adjusting the SL based on the current context can lead to better performance.

DISCO uses a classifier to predict the optimal SL for each iteration. This classifier is trained on data from the target language model, and learns to predict the SL that will result in the best trade-off between speed and output quality.

The authors evaluate DISCO on four different language model benchmarks, and demonstrate average speedup gains of 10.3% compared to previous state-of-the-art speculative decoding methods, while preserving the quality of the final output.

Critical Analysis

The DISCO method represents a significant improvement over previous speculative decoding approaches, but there are a few potential limitations and areas for further research:

The paper does not explore the impact of the classifier architecture or training process on the final performance. It's possible that more advanced classifier designs or training techniques could further improve the results.
The experiments were conducted on a limited set of language model benchmarks. It would be valuable to see how DISCO performs on a wider range of models and tasks, including more specialized or domain-specific applications.
The paper does not provide a detailed analysis of the computational overhead introduced by the classifier. In a real-world deployment, this overhead would need to be carefully managed to ensure the overall performance gains are realized.

Despite these potential areas for improvement, the DISCO method represents an important step forward in the field of speculative execution for large language models, and the authors have made a valuable contribution to the ongoing efforts to accelerate the production of large language models.

Conclusion

This paper introduces a novel speculative decoding method called DISCO, which dynamically adjusts the speculation length at each iteration of the decoding process. By using a classifier to predict the optimal speculation length, DISCO is able to achieve significant speedup gains over previous state-of-the-art approaches, while preserving the quality of the final output.

The DISCO method represents an important advancement in the field of speculative decoding for large language models, and its success suggests that further research into dynamic, context-aware techniques for improving the efficiency of large language models could yield promising results.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👀

Accelerating Speculative Decoding using Dynamic Speculation Length

Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, Roy Schwartz

Speculative decoding is commonly used for reducing the inference latency of large language models. Its effectiveness depends highly on the speculation lookahead (SL)-the number of tokens generated by the draft model at each iteration. In this work we show that the common practice of using the same SL for all iterations (static SL) is suboptimal. We introduce DISCO (DynamIc SpeCulation lookahead Optimization), a novel method for dynamically selecting the SL. Our experiments with four datasets show that DISCO reaches an average speedup of 10% compared to the best static SL baseline, while generating the exact same text.

6/26/2024

SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

Kaixuan Huang, Xudong Guo, Mengdi Wang

Speculative decoding reduces the inference latency of a target large language model via utilizing a smaller and faster draft model. Its performance depends on a hyperparameter K -- the candidate length, i.e., the number of candidate tokens for the target model to verify in each round. However, previous methods often use simple heuristics to choose K, which may result in sub-optimal performance. We study the choice of the candidate length K and formulate it as a Markov Decision Process. We theoretically show that the optimal policy of this Markov decision process takes the form of a threshold policy, i.e., the current speculation should stop and be verified when the probability of getting a rejection exceeds a threshold value. Motivated by this theory, we propose SpecDec++, an enhanced version of speculative decoding that adaptively determines the candidate length on the fly. We augment the draft model with a trained acceptance prediction head to predict the conditional acceptance probability of the candidate tokens. SpecDec++ will stop the current speculation when the predicted probability that at least one token gets rejected exceeds a threshold. We implement SpecDec++ and apply it to the llama-2-chat 7B & 70B model pair. Our adaptive method achieves a 2.04x speedup on the Alpaca dataset (an additional 7.2% improvement over the baseline speculative decoding). On the GSM8K and HumanEval datasets, our method achieves a 2.26x speedup (9.4% improvement) and 2.23x speedup (11.1% improvement), respectively.

6/24/2024

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang

Reducing the inference latency of large language models (LLMs) is crucial, and speculative decoding (SD) stands out as one of the most effective techniques. Rather than letting the LLM generate all tokens directly, speculative decoding employs effective proxies to predict potential outputs, which are then verified by the LLM without compromising the generation quality. Yet, deploying SD in real online LLM serving systems (with continuous batching) does not always yield improvement -- under higher request rates or low speculation accuracy, it paradoxically increases latency. Furthermore, there is no best speculation length work for all workloads under different system loads. Based on the observations, we develop a dynamic framework SmartSpec. SmartSpec dynamically determines the best speculation length for each request (from 0, i.e., no speculation, to many tokens) -- hence the associated speculative execution costs -- based on a new metric called goodput, which characterizes the current observed load of the entire system and the speculation accuracy. We show that SmartSpec consistently reduces average request latency by up to 3.2x compared to non-speculative decoding baselines across different sizes of target models, draft models, request rates, and datasets. Moreover, SmartSpec can be applied to different styles of speculative decoding, including traditional, model-based approaches as well as model-free methods like prompt lookup and tree-style decoding.

6/27/2024

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

Jacob K Christopher, Brian R Bartoldson, Bhavya Kailkhura, Ferdinando Fioretto

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling parallel sequence verification, its efficiency remains inherently limited by the reliance on incremental token generation in existing draft models. To overcome this limitation, this paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences. This allows parallelization of both the drafting and verification steps, providing significant speed-ups to the inference process. Our proposed approach, Speculative Diffusion Decoding (SpecDiff), is validated on standard language generation benchmarks and empirically demonstrated to provide a up to 8.7x speed-up over standard generation processes and up to 2.5x speed-up over existing speculative decoding approaches.

8/20/2024