BiTA: A Groundbreaking Approach to Speed up Language Model Inference

Transformers and Their Importance

In recent years, Large language models (LLMs) based on transformer architectures have emerged and have rapidly increased, ranging from several billion to tens of trillions. Although LLMs are excellent generators, they have trouble with inference delay since there is a lot of computing load from all the parameters. As a result, researchers are working towards speeding up the process, especially for contexts with constrained resources and real-time apps like chatbots.

Challenges in the Current LLM Models

Recent papers show that most decoder-only LLMs follow a token-by-token generation pattern, and due to the autoregressive (AR) nature of token generation, each token must undergo its inference execution, resulting in many transformer calls. This often leads to reduced computational efficiency and longer wall-clock periods due to memory bandwidth restrictions.

Introducing Bi-directional Tuning for Lossless Acceleration (BiTA)

Researchers at Intellifusion Inc. and Harbin Institute of Technology have developed an acceleration approach called Bi-directional Tuning for lossless Acceleration (BiTA) by learning a small number of additional trainable parameters. The main parts of BiTA are the suggested bi-directional tuning and the simplified verification of the SAR draft candidates. Using this approach, the model has shown impressive speedup ranging from 2.1× to 3.3× for numerous generating jobs with LLMs of different sizes.

Source link

Stay in the Loop

Get the daily email from AI Headliner that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

You might also like...