Large language models (LLMs) have recently demonstrated impressive performance across various tasks. These models have billions or trillions of parameters, requiring significant memory and computing power to run. For example, GPT-175B, a large language model, needs 325GB of GPU RAM just to load its model weights. Reducing the resources required for LLM inference has become an area of interest. LLMs are used for several back-end operations such as benchmarking, information extraction, and chatbots. In this study, the focus is on throughput-oriented generative inference, where LLM inference is done in batches across a large number of tokens. This feature allows for better throughput at the expense of latency.
Three approaches have been used to reduce resource needs for LLM inference: model compression, collaborative inference, and offloading. Model compression reduces the memory footprint, collaborative inference spreads out the cost of inference, and offloading makes better use of CPU and disk memory. These approaches have significantly reduced the resource requirements for LLMs. However, there are limitations to each method. Model compression assumes the model fits within GPU memory, while offloading-based systems struggle with throughput on a single GPU due to ineffective I/O scheduling and tensor placement.
The main goal of this study is to develop effective offloading mechanisms for high-throughput generative inference using a single GPU. This involves partial loading of an LLM and executing computation piece by piece by offloading it to secondary storage. The memory hierarchy is divided into tiers, with lower tiers being slower but more abundant, and higher tiers being faster but scarcer. In these systems, small batch sizes can cause bottlenecks. To overcome this, high batch sizes and distributed processing are used to compromise latency in throughput-oriented scenarios. However, achieving high-throughput generative inference with constrained GPU memory is challenging.
The first challenge is coming up with a successful unloading plan, which determines which tensors should be offloaded, where they should be offloaded to, and when during inference. Three types of tensors are used in generative inference: weights, activations, and key-value caching. The design space for offloading is complex due to the batch-by-batch, token-by-token, and layer-by-layer structure of the algorithm.
The second challenge is developing efficient compression algorithms. While earlier publications have shown promising compression results for LLM weights and activations, additional compression strategies are needed when coupled with offloading for high-throughput generative inference.
To address these challenges, researchers from UCB, Stanford, CMU, Meta, Yandex, ETH, and HSE introduce FlexGen, an offloading framework for high-throughput LLM inference. FlexGen effectively schedules I/O activities, implements potential compression techniques, and utilizes distributed pipeline parallelism across GPU, CPU, and disk memory.
The contributions of FlexGen include explicitly describing a search space for offloading options, developing a search algorithm based on linear programming to maximize throughput, and demonstrating the ability to reduce LLM weights and key-value cache by 4 bits without retraining or calibration. FlexGen outperforms other cutting-edge offloading-based inference algorithms, such as DeepSpeed Zero-Inference and Hugging Face Accelerate, in terms of throughput.
For more information and access to the research paper and Github repository of FlexGen, please visit the provided links.