Efficient Language Model Design: Tandem Transformers
Efficiency is a key concern when it comes to large language models (LLMs) due to computational costs. Optimizing the inference process has been a challenge, especially with autoregressive generation leading to high latency. ML accelerators are not optimized for the operations common in LLMs, making autoregressive answer generation less efficient than prompt processing.
New Approach by Google Research and DeepMind
A new study by Google Research and DeepMind introduces Tandem Transformers, a design that prioritizes natural language understanding (NLU) over natural language generation (NLG) in LLMs. By separating the capacity needed for NLU and NLG segments, Tandem Transformers offer a more efficient design without compromising accuracy.
Benefits of Tandem Transformers
Tandem + SPEED is recommended for applications requiring high-quality output. The SPEED framework uses a small model to create draft tokens, which are then verified by a large model. Tandem’s ability to respond to large models’ representations enhances draft quality while reducing verification overhead, resulting in faster processing.
In experiments, Tandem + SPEED with distillation outperformed traditional models on various datasets, offering a significant speed boost without sacrificing output quality. Adaptive block length in SPEED further reduces latency on different datasets.