The Power of Unconstrained Channel Pruning in AI Models
The Growing Challenge of Inference Time with Modern Neural Networks
Modern neural networks are becoming larger and more complex, resulting in longer inference times. To combat this issue, channel pruning has emerged as an effective compression technique. Channel pruning involves removing channels from convolutional weights, thereby reducing the resources required for inference. However, when it comes to multi-branch segments of a model, removing channels becomes more complicated and can introduce additional memory copies at inference time. These extra copies lead to increased latency, to the extent that the pruned model becomes even slower than the original unpruned model.
Introducing Unconstrained Pruning
To address this problem, previous pruning methods have imposed constraints on which channels can be pruned together. While this eliminates the need for memory copies during inference, we have discovered that these constraints significantly affect accuracy. In order to solve both the latency and accuracy challenges, we have come up with a breakthrough insight. We propose enabling unconstrained pruning by reordering channels to minimize memory copies without compromising accuracy.
The UCPE Algorithm: A Game Changer in Pruning
Building on our insight, we have designed a revolutionary algorithm called UCPE. This algorithm allows for the pruning of models with any pattern, removing the constraints imposed by existing pruning heuristics. The results have been remarkable. In experiments with post-training pruning on ImageNet, our UCPE algorithm has improved the top-1 accuracy by an average of 2.1 points, benefiting popular models such as DenseNet (+16.9), EfficientNetV2 (+7.9), and ResNet (+6.2).
Not only does UCPE enhance accuracy, but it also significantly reduces latency. Our algorithm reduces latency by up to 52.8% compared to naive unconstrained pruning. This reduction in latency almost entirely eliminates memory copies required at inference time.