Neural networks have made significant advancements in various applications like language, mathematical reasoning, and vision. However, these networks usually require large computational resources, making it difficult to use them in resource-constrained environments like wearables and smartphones. One way to reduce the computational cost is through pruning, which involves removing unnecessary weights in the network without affecting its performance.
There are different pruning methods that can be used at different stages of the network’s training process. In this article, we focus on post-training pruning, which involves determining which weights should be pruned in a pre-trained network. A popular method is magnitude pruning, which removes weights with the smallest magnitude. While efficient, this method doesn’t consider the impact of weight removal on the network’s performance. Another method is optimization-based pruning, which removes weights based on their impact on the loss function. However, existing optimization-based approaches often have a tradeoff between performance and computational requirements.
In a recent study presented at ICML 2023, we developed a new approach called CHITA (Combinatorial Hessian-free Iterative Thresholding Algorithm) for pruning pre-trained neural networks at scale. CHITA outperforms existing methods in terms of scalability and performance tradeoffs by leveraging advances from high-dimensional statistics, combinatorial optimization, and neural network pruning. For example, CHITA can be significantly faster than other methods, like ResNet, and can improve accuracy by over 10% in many settings.
There are two notable technical improvements in CHITA. First, it efficiently uses second-order information without explicitly computing or storing the Hessian matrix, which allows for better scalability. Second, it uses a combinatorial optimization algorithm that takes into account the impact of pruning one weight on others, avoiding the removal of important weights.
The pruning problem is formulated as a best-subset selection (BSS) problem, where the goal is to select the subset of weights with the least loss. The quadratic loss function is approximated using a second-order Taylor series, and the Hessian is estimated with the empirical Fisher information matrix. By exploiting the low-rank structure of the empirical Fisher information matrix, CHITA avoids the computational challenges of explicitly computing the Hessian matrix.
To solve the pruning BSS problem, CHITA uses an iterative hard thresholding (IHT) algorithm with a modification that sets all regression coefficients outside the top-k coefficients to zero. To improve convergence, CHITA utilizes a new line-search method and employs computational schemes to enhance efficiency and the quality of the second-order approximation.
In experiments comparing CHITA with other pruning methods on different architectures like ResNet and MobileNet, CHITA demonstrated superior scalability and accuracy. For example, CHITA achieved a speed-up of over 1000x when pruning ResNet and showed good improvements in post-pruning accuracy compared to magnitude pruning and other methods.