Distributed Speculative Inference of Large Language Models

Read original: arXiv:2405.14105 - Published 9/10/2024 by Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel

🤯

Overview

This paper introduces a novel distributed inference algorithm called Distributed Speculative Inference (DSI) that can accelerate the inference of large language models (LLMs).
DSI is designed to be faster than traditional autoregressive inference and a previous technique called Speculative Inference (SI).
DSI works on frozen LLMs and preserves the target distribution, without requiring any training or architectural changes.

Plain English Explanation

The paper presents a new way to speed up the process of using large language models (LLMs) to generate text, which is an important challenge in artificial intelligence. The proposed method, called Distributed Speculative Inference (DSI), works by orchestrating multiple instances of the target LLM and some supporting "drafter" LLMs to produce text faster than previous approaches.

Unlike some prior techniques, DSI does not require any changes to the LLM itself or additional training. It can be used with existing, "frozen" LLMs. DSI also ensures that the output text still matches the original target distribution, meaning the quality and characteristics of the generated text are preserved.

Prior research on a related technique called Speculative Inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] has shown it can speed up inference compared to traditional methods. However, SI relies on having a fast and accurate "drafter" LLM to work well. In practice, off-the-shelf LLMs often don't have a matching drafter that is sufficiently fast and accurate.

This paper shows that when the drafter LLM is slower or less accurate, SI can actually become slower than the traditional, non-speculative approach. The key contribution of this work is proving that DSI is faster than both SI and the non-speculative approach, regardless of the quality of the drafter LLM.

Technical Explanation

The paper introduces a novel distributed inference algorithm called Distributed Speculative Inference (DSI). Like other Speculative Inference (SI) algorithms, DSI works on frozen LLMs and preserves the target distribution without requiring any training or architectural modifications.

Prior studies on SI [leviathan2023fast, chen2023accelerating, miao2023specinfer] have shown empirical speedups compared to traditional, non-speculative inference. However, these SI approaches require a fast and accurate "drafter" LLM to work well. In practice, off-the-shelf LLMs often do not have matching drafters that are sufficiently fast and accurate.

The paper identifies a gap where SI can actually become slower than non-SI when using slower or less accurate drafters. To address this, the authors prove that DSI is faster than both SI and non-SI, given any drafters.

By orchestrating multiple instances of the target LLM and the drafter LLMs, DSI is able to not only outperform SI, but also support LLMs that cannot be accelerated with SI.

The paper's simulations demonstrate speedups of 1.29-1.92x over SI for realistic settings using off-the-shelf LLMs.

Critical Analysis

The paper provides a thorough theoretical analysis and empirical evaluation of the proposed DSI algorithm. The authors clearly identify the limitations of existing Speculative Inference (SI) approaches and address them with the novel DSI technique.

One potential area for further research could be investigating the trade-offs between the computational overhead of orchestrating multiple LLM instances in DSI versus the overall performance gains. The paper mentions that DSI supports LLMs that cannot be accelerated with SI, but does not provide details on the specific requirements or limitations of each approach.

Additionally, the paper focuses on the acceleration of inference, but does not discuss the implications for model training or the potential impact on model performance, robustness, or safety. Further research could explore these broader considerations when deploying DSI or other speculative inference techniques in real-world applications.

Conclusion

This paper presents a new distributed inference algorithm called Distributed Speculative Inference (DSI) that can significantly accelerate the inference of large language models (LLMs) compared to both traditional autoregressive inference and the previously proposed Speculative Inference (SI) approach.

By orchestrating multiple instances of the target LLM and supporting "drafter" LLMs, DSI is able to outperform SI even when the drafters are slower or less accurate. This addresses a key limitation of prior SI techniques and expands the applicability of speculative inference methods to a wider range of LLMs.

The demonstrated speedups of 1.29-1.92x over SI in realistic settings suggest that DSI could have a meaningful impact on the practical deployment and use of large language models, potentially enabling faster and more efficient inference without compromising the quality of the generated text.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤯

Distributed Speculative Inference of Large Language Models

Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel

Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces Distributed Speculative Inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast,chen2023accelerating,miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs, requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups (compared to non-SI) but require fast and accurate drafters, which are often unavailable in practice. We identify a gap where SI can be slower than non-SI given slower or less accurate drafters. We close this gap by proving that DSI is faster than both SI and non-SI--given any drafters. DSI introduces a novel type of task parallelism called Speculation Parallelism (SP), which orchestrates target and drafter instances to overlap in time, creating a new foundational tradeoff between computational resources and latency. DSI is not only faster than SI but also supports LLMs that cannot be accelerated with SI. Our simulations show speedups of off-the-shelf LLMs in realistic single-node settings where DSI is 1.29-1.92x faster than SI.

9/10/2024

📶

De-DSI: Decentralised Differentiable Search Index

Petru Neague, Marcel Gregoriadis, Johan Pouwelse

This study introduces De-DSI, a novel framework that fuses large language models (LLMs) with genuine decentralization for information retrieval, particularly employing the differentiable search index (DSI) concept in a decentralized setting. Focused on efficiently connecting novel user queries with document identifiers without direct document access, De-DSI operates solely on query-docid pairs. To enhance scalability, an ensemble of DSI models is introduced, where the dataset is partitioned into smaller shards for individual model training. This approach not only maintains accuracy by reducing the number of data each model needs to handle but also facilitates scalability by aggregating outcomes from multiple models. This aggregation uses a beam search to identify top docids and applies a softmax function for score normalization, selecting documents with the highest scores for retrieval. The decentralized implementation demonstrates that retrieval success is comparable to centralized methods, with the added benefit of the possibility of distributing computational complexity across the network. This setup also allows for the retrieval of multimedia items through magnet links, eliminating the need for platforms or intermediaries.

4/22/2024

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

Jacob K Christopher, Brian R Bartoldson, Bhavya Kailkhura, Ferdinando Fioretto

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling parallel sequence verification, its efficiency remains inherently limited by the reliance on incremental token generation in existing draft models. To overcome this limitation, this paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences. This allows parallelization of both the drafting and verification steps, providing significant speed-ups to the inference process. Our proposed approach, Speculative Diffusion Decoding (SpecDiff), is validated on standard language generation benchmarks and empirically demonstrated to provide a up to 8.7x speed-up over standard generation processes and up to 2.5x speed-up over existing speculative decoding approaches.

8/20/2024

👀

Accelerating Speculative Decoding using Dynamic Speculation Length

Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, Roy Schwartz

Speculative decoding is commonly used for reducing the inference latency of large language models. Its effectiveness depends highly on the speculation lookahead (SL)-the number of tokens generated by the draft model at each iteration. In this work we show that the common practice of using the same SL for all iterations (static SL) is suboptimal. We introduce DISCO (DynamIc SpeCulation lookahead Optimization), a novel method for dynamically selecting the SL. Our experiments with four datasets show that DISCO reaches an average speedup of 10% compared to the best static SL baseline, while generating the exact same text.

6/26/2024