Large language models (LLMs) have undergone significant improvements, creating new opportunities in various fields and inspiring a wave of interactive AI applications. One such application is ChatGPT, which allows people to have informal conversations with an AI agent to solve problems in software engineering and language translation. ChatGPT has become incredibly popular due to its impressive capabilities. Many companies, like Microsoft, Google, Meta, Stanford, Databricks, and UC Berkeley, have followed suit and released similar products.
LLM inference is different from other deep neural network (DNN) models, like ResNet, because it has its own unique characteristics. Interactive AI applications based on LLMs require quick job completion times (JCT) to provide seamless user experiences. For example, when users interact with ChatGPT, they expect immediate responses. However, the infrastructure for LLM inference is under strain due to the complexity and quantity of LLMs. Businesses have had to invest in expensive clusters with accelerators to handle these operations.
DNN inference jobs are usually predictable and deterministic, meaning that the model and hardware determine the execution time. In contrast, LLM inference follows an autoregressive pattern. Each iteration produces an output token that is added to the input for the next iteration. The length of the output affects both the execution time and input length. Existing inference serving systems, like Clockwork and Shepherd, are designed for deterministic model inference tasks but struggle with the variable execution times of LLM inference.
To address this challenge, researchers from Peking University developed a distributed inference serving solution called FastServe. This solution uses iteration-level scheduling and takes advantage of the autoregressive pattern of LLM inference to reduce JCT and head-of-line blocking. FastServe employs a unique skip-join Multi-Level Feedback Queue (MLFQ) scheduler as its foundation, which minimizes average JCT in information-free environments. Each task is assigned to a priority queue based on the execution time of the first output token, bypassing higher priority queues to minimize downgrades.
FastServe also manages GPU memory effectively to prevent overflow. It proactively uploads process states in low-priority queues and offloads them when the cache is nearly full. This proactive approach avoids head-of-line blocking that would occur if new jobs were delayed. FastServe utilizes parallelization techniques like tensor and pipeline parallelism and a distributed key-value cache to provide distributed inference serving for large models that require multiple GPUs.
The results of implementing the FastServe system prototype using NVIDIA FasterTransformer show improvements in both average and tail JCT compared to the current leading solution, Orca. FastServe enhances average JCT by up to 5.1 and tail JCT by up to 6.4, showcasing its effectiveness.
Overall, FastServe offers a solution to the challenges posed by LLM inference. Its unique approach to scheduling and memory management optimizes performance and user experience, making it a valuable tool for AI applications built on LLMs.
Don’t forget to check out the Paper for more details and join our ML SubReddit, Discord Channel, and Email Newsletter for the latest AI research news and projects. If you have any questions or feedback, feel free to email us at Asif@marktechpost.com.