Breaking Boundaries: Unconstrained Pruning Boosts Accuracy and Speed of Neural Networks

The Power of Unconstrained Channel Pruning in AI Models

The Growing Challenge of Inference Time with Modern Neural Networks

Modern neural networks are becoming larger and more complex, resulting in longer inference times. To combat this issue, channel pruning has emerged as an effective compression technique. Channel pruning involves removing channels from convolutional weights, thereby reducing the resources required for inference. However, when it comes to multi-branch segments of a model, removing channels becomes more complicated and can introduce additional memory copies at inference time. These extra copies lead to increased latency, to the extent that the pruned model becomes even slower than the original unpruned model.

Introducing Unconstrained Pruning

To address this problem, previous pruning methods have imposed constraints on which channels can be pruned together. While this eliminates the need for memory copies during inference, we have discovered that these constraints significantly affect accuracy. In order to solve both the latency and accuracy challenges, we have come up with a breakthrough insight. We propose enabling unconstrained pruning by reordering channels to minimize memory copies without compromising accuracy.

The UCPE Algorithm: A Game Changer in Pruning

Building on our insight, we have designed a revolutionary algorithm called UCPE. This algorithm allows for the pruning of models with any pattern, removing the constraints imposed by existing pruning heuristics. The results have been remarkable. In experiments with post-training pruning on ImageNet, our UCPE algorithm has improved the top-1 accuracy by an average of 2.1 points, benefiting popular models such as DenseNet (+16.9), EfficientNetV2 (+7.9), and ResNet (+6.2).

Not only does UCPE enhance accuracy, but it also significantly reduces latency. Our algorithm reduces latency by up to 52.8% compared to naive unconstrained pruning. This reduction in latency almost entirely eliminates memory copies required at inference time.

Source link

Stay in the Loop

Get the daily email from AI Headliner that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

You might also like...