Large Language Models (LLMs) are a powerful tool across various AI applications but deploying them comes with an expensive cost, and devices like phones do not have enough memory to host them. To address this issue, the team of researchers introduced LLM Surgeon, a framework for unstructured, semi-structured, and structured LLM pruning. Their framework allows for the pruning of LLMs by up to 30% without any significant performance degradation, demonstrating its effectiveness. LLM Surgeon uses weight magnitude and activations to relate weight removal costs to the true final objective. The framework prunes multiple weights at once to reach the target model size while inflicting the least possible cost. It also prunes in multiple steps to improve the performance-to-sparsity. Moreover, LLM Surgeon outperforms all baselines, achieving the best performance across target sizes. In conclusion, LLM Surgeon addresses the problem posed by LLMs with a significantly large number of parameters in terms of deployment and enables an easier deployment process.
LLM Surgeon Optimization
LLM Surgeon uses the KFAC approximation for memory-efficient curvature approximation. This method helps to compute the dynamic allocation of structures that can be removed and allows the updation of the remaining weights, accounting for the removal. The researchers have improved the previous works in weight pruning by using more accurate approximations to the loss curvature and more weight correlations to update remaining weights. The researchers justified their approach by showing that the pruning performance increased with more shots.
The researchers evaluated the performance of LLM Surgeon on language modeling tasks on models like OPT and LLaMA-2, using data from the wikitext-2 dataset. For structured compression, the framework allows the model size to be reduced by up to 30% without any significant loss. Moreover, it performs better than all baselines, achieving the best performance for each target size. LLM Surgeon outperforms all baselines across all target sizes for semi-structured and unstructured compression as well, demonstrating the best performance.