Advancements in Large Language Models (LLMs) have led to models with billions or even trillions of parameters, showing great performance in various fields. But, their massive size poses challenges in real-world use due to hardware requirements. Research focused on scaling models up to improve performance, following established scaling laws. This highlights the need to address hardware limitations for wider LLM utilization.
To tackle the deployment challenge, researchers have looked into model compression techniques like quantization and pruning. Quantization lowers precision, while pruning removes unnecessary parameters without retraining. Recent advancements in pruning techniques simplify model compression for LLMs, emphasizing the need for efficient pruning methods tailored for these models.
A unique approach called ShortGPT, developed by researchers from Baichuan Inc. and the Chinese Information Processing Laboratory Institute of Software, Chinese Academy of Sciences, analyzes layer-wise redundancy in LLMs using Block Influence (BI) to measure hidden state transformations. This method outperforms complex pruning techniques by identifying and removing redundant layers based on BI scores. It reveals significant layer redundancy in LLMs and offers a simple yet effective pruning strategy, reducing parameters and computation without compromising performance.
The proposed LLM layer deletion approach involves quantifying layer redundancy, especially in Transformer-based architectures, using BI metrics. Layers with low BI scores, indicating minimal impact, are removed to reduce inference costs while maintaining performance. The method includes constructing a calibration set, collecting hidden states, calculating BI scores, and iteratively deleting less important layers based on BI rankings.
Comparative experiments against benchmarks and baseline techniques commonly used in LLM evaluation show that the model pruned using the proposed approach consistently outperforms baseline methods across multiple natural language benchmarks. Removing layers proves more effective than reducing embedding dimensions, revealing deeper redundancy within the models.
In conclusion, ShortGPT, developed by researchers from Baichuan Inc. and the Chinese Information Processing Laboratory Institute of Software, Chinese Academy of Sciences, offers a unique LLM pruning approach focused on layer redundancy and attention entropy. The results demonstrate significant layer-wise redundancy in LLMs, enabling the removal of minimally contributing layers without compromising performance. This simple yet effective strategy maintains up to 95% of model performance while reducing parameter count and computational requirements by around 25%, surpassing previous pruning methods. It suggests depth-based redundancy in LLMs and is compatible with other compression techniques for versatile model size reduction.
Check out the Paper for more details on this research. For updates, follow us on Twitter and Google News, and don’t forget to join our ML SubReddit, Facebook Community, Discord Channel, and LinkedIn Group. Don’t miss our newsletter and join our Telegram Channel for more AI-related content and free AI courses.