Boosting Throughput of Large Language Models (LLMs) with PagedAttention
Large language models (LLMs) are changing the way we live and work, enabling new applications like programming assistants and universal chatbots. However, the operation of these applications can be costly due to the hardware accelerator requirements. Recent studies have shown that handling an LLM request can be up to ten times more expensive than a traditional keyword search. To minimize per-request expenses, there is a growing need to boost the throughput of LLM serving systems.
The Problem with Existing Systems
Performing high throughput serving of LLMs requires batching multiple requests at a time. However, existing systems face challenges in managing the large key-value cache (KV cache) memory for each request. Inefficient management can lead to fragmentation and redundant duplication, resulting in wasted RAM and reduced batch size.
PagedAttention: A Solution for Efficient Memory Utilization
The researchers have proposed a solution called PagedAttention, which is an attention algorithm inspired by virtual memory and paging techniques in operating systems. This algorithm effectively manages attention keys and values, providing almost zero waste in KV cache memory and flexible sharing of KV cache within and between requests. By utilizing PagedAttention, the researchers have developed vLLM, an LLM serving system that achieves up to 24 times more throughput than HuggingFace Transformers without any changes to the model architecture.
PagedAttention divides the KV cache into blocks, each containing the keys and values for a certain number of tokens. These blocks can be stored in non-contiguous memory space, allowing for flexible management. The memory leakage only occurs in the last block of a sequence, leading to efficient memory utilization and greater GPU utilization.
Efficient Memory Sharing with PagedAttention
In addition to memory utilization, PagedAttention also offers efficient memory sharing. This feature reduces the additional memory required for sampling techniques like parallel sampling and beam search by up to 55%. As a result, these sample techniques become more useful and effective for LLM services, providing a speed gain of up to 2.2 times.
The researchers conducted accuracy tests and found that vLLM with PagedAttention significantly increased the throughput of well-known LLMs by 2-4 times compared to cutting-edge systems like FasterTransformer and Orca. The improvement is more noticeable for larger models, intricate decoding algorithms, and longer sequences.