How Self-Speculative Decoding Speeds Up Large Language Models
Large Language Models (LLMs) like GPT, PaLM, and LLaMA have found widespread use in various real-world applications. These models can perform tasks such as text production, translation, and natural language interpretation. However, their high inference costs can be a major concern, especially when low latency is important. The autoregressive decoding method used by these models is the primary cause of these high costs. Each output token is generated sequentially, resulting in a large number of Transformer calls. Limited memory bandwidth during each call leads to inefficient computation and longer execution times.
The Solution: Self-Speculative Decoding
A recent study has introduced a method called self-speculative decoding to speed up the inference process of Large Language Models (LLMs) without requiring an auxiliary model. This approach aims to produce the inference faster while maintaining output quality. It involves a two-stage procedure consisting of drafting and verification.
The objective of the drafting stage is to generate draft tokens more quickly, even if they are slightly lower in quality compared to tokens generated using traditional autoregressive methods. To achieve this, certain intermediate layers in LLMs are bypassed during drafting. Although these layers refine the output, they consume significant time and resources during inference.
In the verification stage, the drafted output tokens are generated in the drafting stage, and then validated in a single forward pass using the original, unaltered LLM. This verification step ensures that the LLM would have produced the same result using the conventional autoregressive decoding technique. Thus, even though the drafting stage generates tokens faster, the quality of the final output is preserved.
One of the main advantages of self-speculative decoding is that it does not require additional neural network training. Existing methods for faster inference often involve training auxiliary models or making significant changes to the LLM’s architecture, which can be challenging and resource-intensive. Self-speculative decoding, on the other hand, is a “plug-and-play” approach that can be added to existing LLMs without any additional training or model alterations.
Empirical research has provided evidence of the effectiveness of self-speculative decoding. Benchmark results using LLaMA-2 and its improved models show that the self-speculative decoding method can decode data up to 1.73 times faster than the conventional autoregressive method. This substantially speeds up the inference process while maintaining output quality, which is crucial in situations where latency is a concern.
Self-speculative decoding is a revolutionary method that enhances how Large Language Models infer information. It achieves this by implementing a two-step process of drafting and verification, selectively bypassing certain layers during the drafting stage to generate tokens faster, and ensuring output quality during the verification stage. This method accelerates LLM inference without adding extra memory burden or requiring neural network training.
Check out the Paper. All credit for this research goes to the researchers on this project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter. Subscribe now.