Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding

2402.11809

Published 4/17/2024 by Hanling Yi, Feng Lin, Hongbin Li, Peiyang Ning, Xiaotian Yu, Rong Xiao

🛸

Abstract

This research aims to accelerate the inference speed of large language models (LLMs) with billions of parameters. We propose textbf{S}mart textbf{P}arallel textbf{A}uto-textbf{C}orrect dtextbf{E}coding (SPACE), an innovative approach designed for achieving lossless acceleration of LLMs. By integrating semi-autoregressive inference and speculative decoding capabilities, SPACE uniquely enables autoregressive LLMs to parallelize token generation and verification. This is realized through a specialized semi-autoregressive supervised fine-tuning process that equips existing LLMs with the ability to simultaneously predict multiple tokens. Additionally, an auto-correct decoding algorithm facilitates the simultaneous generation and verification of token sequences within a single model invocation. Through extensive experiments on a range of LLMs, SPACE has demonstrated inference speedup ranging from 2.7x-4.0x on HumanEval-X while maintaining output quality.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This research paper proposes an innovative approach called SPACE (Smart Parallel Auto-Correct Encoding) to accelerate the inference speed of large language models (LLMs) with billions of parameters.
SPACE integrates semi-autoregressive inference and speculative decoding capabilities to enable autoregressive LLMs to parallelize token generation and verification.
Through extensive experiments, SPACE has demonstrated inference speedup ranging from 2.7x to 4.0x on the HumanEval-X benchmark while maintaining output quality.

Plain English Explanation

Large language models (LLMs) like GPT-3 have become incredibly powerful, but they can be slow to generate text. This research presents a new technique called SPACE that can significantly speed up the process.

The key idea is to have the model predict multiple tokens at once, rather than one-by-one. This allows it to generate text in parallel, rather than sequentially. To make this work, the researchers had to train the model in a special way to give it the ability to predict multiple tokens simultaneously.

They also developed an "auto-correct" algorithm that can check the generated tokens and fix any mistakes. This means the model doesn't have to wait to verify each token before moving on to the next one.

Through extensive testing, the researchers found that SPACE can speed up the inference process by 2.7 to 4 times, while still producing high-quality text. This could make large language models much more practical for real-world applications that require fast text generation.

Technical Explanation

The researchers propose SPACE, an innovative approach to accelerate the inference of large language models (LLMs). SPACE integrates two key capabilities:

Semi-Autoregressive Inference: SPACE enables LLMs to predict multiple tokens simultaneously, breaking free from the traditional one-token-at-a-time autoregressive generation. This is achieved through a specialized semi-autoregressive supervised fine-tuning process that equips existing LLMs with the ability to predict multiple tokens in parallel.
Speculative Decoding: SPACE introduces an auto-correct decoding algorithm that facilitates the simultaneous generation and verification of token sequences within a single model invocation. This allows the model to speculatively generate multiple token sequences and rapidly identify the most likely one, avoiding the need for sequential token-by-token verification.

Through extensive experiments on a range of LLMs, including LLaMA, GPT-3, and SMEDLEY, the researchers demonstrate that SPACE can achieve inference speedup ranging from 2.7x to 4.0x on the HumanEval-X benchmark, while maintaining output quality comparable to the original autoregressive models.

Critical Analysis

The researchers acknowledge several limitations and areas for further research:

Generalization to other tasks: The experiments in this paper focus on open-ended text generation, and the researchers note that the effectiveness of SPACE may vary for other tasks, such as question-answering or code generation.
Computational Overhead: While SPACE achieves significant inference speedup, the researchers mention that the semi-autoregressive fine-tuning process and the speculative decoding algorithm may introduce additional computational overhead during the training and deployment stages, which could be an important consideration for certain applications.
Potential Quality Degradation: The researchers report that SPACE maintains output quality compared to the original autoregressive models, but it would be valuable to further investigate potential quality degradation, especially for more complex or domain-specific language generation tasks.

Overall, the SPACE approach represents an exciting advancement in accelerating large language models, but there are still areas that merit further research and careful consideration before widespread adoption.

Conclusion

This research presents a novel technique called SPACE that can significantly speed up the inference of large language models while maintaining output quality. By integrating semi-autoregressive inference and speculative decoding, SPACE enables autoregressive LLMs to parallelize token generation and verification, leading to inference speedups of 2.7x to 4.0x on the HumanEval-X benchmark.

The implications of this work are potentially far-reaching, as it could make large language models more practical and accessible for a wide range of real-world applications that require fast text generation, from chatbots and virtual assistants to content creation and summarization tools. As the field of natural language processing continues to advance, innovations like SPACE will be crucial in unlocking the full potential of these powerful AI models.

Related Papers

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Pengfei Wu, Jiahao Liu, Zhuocheng Gong, Qifan Wang, Jinpeng Li, Jingang Wang, Xunliang Cai, Dongyan Zhao

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly evident when utilizing autoregressive decoding methods, which generate one token in a single forward process, thereby not fully capitalizing on the parallel computing capabilities of GPUs. In this paper, we propose a novel parallel decoding approach, namely textit{hidden transfer}, which decodes multiple successive tokens simultaneously in a single forward pass. The idea is to transfer the intermediate hidden states of the previous context to the textit{pseudo} hidden states of the future tokens to be generated, and then the pseudo hidden states will pass the following transformer layers thereby assimilating more semantic information and achieving superior predictive accuracy of the future tokens. Besides, we use the novel tree attention mechanism to simultaneously generate and verify multiple candidates of output sequences, which ensure the lossless generation and further improves the generation efficiency of our method. Experiments demonstrate the effectiveness of our method. We conduct a lot of analytic experiments to prove our motivation. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.

4/19/2024

cs.CL

On Speculative Decoding for Multimodal Large Language Models

Mukul Gagrani, Raghavv Goel, Wonseok Jeon, Junyoung Park, Mingu Lee, Christopher Lott

Inference with Multimodal Large Language Models (MLLMs) is slow due to their large-language-model backbone which suffers from memory bandwidth bottleneck and generates tokens auto-regressively. In this paper, we explore the application of speculative decoding to enhance the inference efficiency of MLLMs, specifically the LLaVA 7B model. We show that a language-only model can serve as a good draft model for speculative decoding with LLaVA 7B, bypassing the need for image tokens and their associated processing components from the draft model. Our experiments across three different tasks show that speculative decoding can achieve a memory-bound speedup of up to 2.37$times$ using a 115M parameter language model that we trained from scratch. Additionally, we introduce a compact LLaVA draft model incorporating an image adapter, which shows marginal performance gains in image captioning while maintaining comparable results in other tasks.

4/16/2024

cs.CL cs.AI cs.LG

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

Jie Ou, Yueming Chen, Wenhong Tian

While Large Language Models (LLMs) have shown remarkable abilities, they are hindered by significant resource consumption and considerable latency due to autoregressive processing. In this study, we introduce Adaptive N-gram Parallel Decoding (ANPD), an innovative and lossless approach that accelerates inference by allowing the simultaneous generation of multiple tokens. ANPD incorporates a two-stage approach: it begins with a rapid drafting phase that employs an N-gram module, which adapts based on the current interactive context, followed by a verification phase, during which the original LLM assesses and confirms the proposed tokens. Consequently, ANPD preserves the integrity of the LLM's original output while enhancing processing speed. We further leverage a multi-level architecture for the N-gram module to enhance the precision of the initial draft, consequently reducing inference latency. ANPD eliminates the need for retraining or extra GPU memory, making it an efficient and plug-and-play enhancement. In our experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x, validating the effectiveness of our proposed ANPD.

4/16/2024

cs.CL cs.LG

SpaceByte: Towards Deleting Tokenization from Large Language Modeling

Kevin Slagle

Tokenization is widely used in large language models because it significantly improves performance. However, tokenization imposes several disadvantages, such as performance biases, increased adversarial vulnerability, decreased character-level modeling performance, and increased modeling complexity. To address these disadvantages without sacrificing performance, we propose SpaceByte, a novel byte-level decoder architecture that closes the performance gap between byte-level and subword autoregressive language modeling. SpaceByte consists of a byte-level Transformer model, but with extra larger transformer blocks inserted in the middle of the layers. We find that performance is significantly improved by applying these larger blocks only after certain bytes, such as space characters, which typically denote word boundaries. Our experiments show that for a fixed training and inference compute budget, SpaceByte outperforms other byte-level architectures and roughly matches the performance of tokenized Transformer architectures.

4/23/2024

cs.CL cs.AI cs.LG