Streamlining Giants. The Evolution of Model Compression in… | by Nate Cibik | Feb, 2024
The quest to refine neural networks for practical applications traces its roots back to the foundational days of the field. When Rumelhart, Hinton, and Williams first demonstrated how to use the backpropagation algorithm to successfully train multi-layer neural networks that could learn complex, non-linear representations in 1986, the vast potential of these models became apparent. However, the computational power available in the 1980s restricted their practical use and the complexity of problems they could solve, a situation which mirrors the challenges we face with deploying LLMs today. Although the scale of models and the considerations being made were very different, early discoveries in network minimization would pave the way for big wins in model compression decades later. In this section, we take a brief journey through the history and motivations driving pruning research, discover the comparative strengths and weaknesses of unstructured versus structured methods, and prepare ourselves to explore their use in the modern era of LLMs.
Network pruning was originally motivated by the pursuit of better model generalization through freezing unimportant weights at zero, somewhat akin in theory to L1/Lasso and L2/Ridge regularization in linear regression, though different in that weights are selected and hard-set to zero (pruned) after training based on an importance criteria rather than being coaxed towards zero mathematically by the loss function during training (informed readers will know that regularization can also be achieved in neural network training using weight decay).
The common motivation behind both regularization and pruning (which can be seen as a form of regularization) is the theoretical and empirical evidence that neural networks are most effective at learning when overparameterized thanks to a higher-dimensional manifold of the loss function’s global minima and a larger exploration space in which effective subnetworks are more likely to be initialized (see “the lottery ticket hypothesis”). However, this overparameterization in turn leads to overfitting on the training data, and ultimately results in a network with many redundant or inactive weights. Although the theoretical mechanisms underlying the “unreasonable effectiveness” of overparameterized neural networks were less well studied at the time, researchers in the 1980s correctly hypothesized that it should be possible to remove a large portion of the network weights after training without significantly affecting task performance, and that performing iterative rounds of pruning and fine-tuning the remaining model weights should lead to better generalization, enhancing the model’s ability to perform well on unseen data.
Unstructured Pruning
To select parameters for removal, a measure of their impact on the cost function, or “saliency,” is required. While the earliest works in network minimization worked under the assumption that the magnitude of parameters should serve as a suitable measure of their saliency, LeCun et al. made a significant step forward in 1989 with “Optimal Brain Damage” (OBD), in which they proposed to use a theoretically justifiable measure of saliency using second-derivative information of the cost function with respect to the parameters, allowing them to directly identify the parameters which could be removed with the least increase in error.
Written in the era when the model of interest was a fully-connected neural network containing just 2,600 parameters, the authors of OBD were less concerned about removing weights due to computational efficiency than we are today with our billionaire behemoths, and were more interested in improving the model’s ability to generalize to unseen data by reducing model complexity. Even operating on a tiny model like this, however, the calculation of second-derivative information (Hessian matrix) is very expensive, and required the authors to make three convenient mathematical assumptions: 1) that the model is currently trained to an optimum, meaning the gradient of the loss with respect to every weight is currently zero and the slope of the gradient is positive in both directions, which zeroes out the first-order term of the Taylor expansion and implies the change in loss caused by pruning any parameter is positive, 2) that the Hessian matrix is diagonal, meaning the change in loss caused by removal of each parameter is independent, and therefore the loss deltas can be summed over subset of weights to calculate the total change in loss caused by their collective removal, and 3) that the loss function is nearly quadratic, meaning higher-order terms can be neglected from the Taylor expansion.
Despite this requisite list of naïve assumptions, their theoretically justified closed-form saliency metric proved itself superior to magnitude-based pruning in accurately identifying the least important weights in a network, able to retain more accuracy at higher rates of compression. Nevertheless, the efficacy and profound simplicity of magnitude-based pruning methods would make them the top choice for many future research endeavors in model compression, particularly as network sizes began to scale quickly, and Hessians became exponentially more frightening. Still, this successful demonstration of using a theoretically justified saliency measure to more accurately estimate saliency and thereby enable more aggressive pruning provided an inspirational recipe for future victories in model compression, although it would be some time before those seeds bore fruit.
Four years later in 1993, Hassibi et al.’s Optimal Brain Surgeon (OBS) expanded on the concept of OBD and raised the levels of compression possible without increasing error by eschewing the diagonality assumption of OBD and instead considering the cross-terms within the Hessian matrix. This allowed them to determine optimal updates to the remaining weights based on the removal of a given parameter, simultaneously pruning and optimizing the model, thereby avoiding the need for a retraining phase. However, this meant even more complex mathematics, and OBS was thus initially of limited utility to 21st Century researchers working with much larger networks. Nonetheless, like OBD, OBS would eventually see its legacy revived in future milestones, as we will see later.
The pruning methods in OBD and OBS are examples of unstructured pruning, wherein weights are pruned on an individual basis based on a measure of their saliency. A modern exemplar of unstructured pruning techniques is Han et al. 2015, which reduced the sizes of the early workhorse convolutional neural networks (CNNs) AlexNet and VGG-16 by 9x and 13x, respectively, with no loss in accuracy, using one or more rounds of magnitude-based weight pruning and fine-tuning. Their method unfortunately requires performing sensitivity analysis of the network layers to determine the best pruning rate to use for each individual layer, and works best when retrained at least once, which means it would not scale well to extremely large networks. Nevertheless, it is impressive to see the levels of pruning which can be accomplished using their unstructured approach, especially since they are using magnitude-based pruning. As with any unstructured approach, the reduced memory footprint can only be realized by using sparse matrix storage techniques which avoid storing the zeroed parameters in dense matrices. Although they do not employ it in their study, the authors mention in their related work section that the hashing trick (as demonstrated in the 2015 HashedNets paper) is complementary to unstructured pruning, as increasing sparsity decreases the number of unique weights in the network, thereby reducing the probability of hash collisions, which leads to lower storage demands and more efficient weight retrieval by the hashing function.
While unstructured pruning has the intended regularization effect of improved generalization through reduced model complexity, and the memory footprint can then be shrunk substantially by using sparse matrix storage methods, the gains in computational efficiency offered by this type of pruning are not so readily accessed. Simply zeroing out individual weights without consideration of the network architecture will create matrices with irregular sparsity that will realize no efficiency gains when computed using dense matrix calculations on standard hardware. Only specialized hardware which is explicitly designed to exploit sparsity in matrix operations can unlock the computational efficiency gains offered by unstructured pruning. Fortunately, consumer hardware with these capabilities is becoming more mainstream, enabling their users to actualize performance gains from the sparse matrices created from unstructured pruning. However, even these specialized hardware units must impose a sparsity ratio expectation on the number of weights in each matrix row which should be pruned in order to allow for the algorithmic exploitation of the resulting sparsity, known as semi-structured pruning, and enforcing this constraint has been shown to degrade performance more than purely unstructured pruning.
Structured Pruning
We’ve seen that unstructured pruning is a well-established regularization technique that is known to improve model generalization, reduce memory requirements, and offer efficiency gains on specialized hardware. However, the more tangible benefits to computational efficiency are offered by structured pruning, which entails removing entire structural components (filters, layers) from the network rather than individual weights, which reduces the complexity of the network in ways that align with how computations are performed on hardware, allowing for gains in computational efficiency to be easily realized without specialized kit.
A formative work in popularizing the concept of structured pruning for model compression was the 2016 Li et al. paper “Pruning Filters for Efficient ConvNets,” where, as the title suggests, the authors pruned filters and their associated feature maps from CNNs in order to greatly improve computational efficiency, as the calculations surrounding these filters can be easily excluded by physically removing the chosen kernels from the model, directly reducing the size of the matrices and their multiplication operations without needing to worry about exploiting sparsity. The authors used a simple sum of filter weights (L1 norm) for magnitude-based pruning of the filters, demonstrating that their method could reduce inferences costs of VGG-16 and ResNet-110 by 34% and 38%, respectively, without significant degradation of accuracy.
Their study also reveals some fascinating insights about how convolutional networks work by comparing the sensitivity of individual CNN layers to pruning, revealing that layers on the very beginning or past halfway through the depth of the network were able to be pruned aggressively with almost no impact on the model performance, but that layers around 1/4 of the way into the network were very sensitive to pruning and doing so made recovering model performance difficult, even with retraining. The results, shown below, reveal that the layers which are most sensitive to pruning are those containing many filters with large absolute sums, supporting the theory of magnitude as a saliency measure, as these layers are clearly more important to the network, since pruning them away causes pronounced negative impact on model performance which is difficult to recover.
Most importantly, the results from Li et al. show that many layers in a CNN could be pruned of even up to 90% of their filters without harming (and in some cases even improving) model performance. Additionally, they found that when pruning filters from the insensitive layers, iterative retraining layer-by-layer was unnecessary, and a single round of pruning and retraining (for 1/4 of the original training time) was all that was required to recover model performance after pruning away significant portions of the network. This is great news in terms of efficiency, since multiple rounds of retraining can be costly, and previous work had reported requiring up to 3x original training time to produce their pruned models. Below we can see the overall results from Li et al. which reveal that the number of floating point operations (FLOPs) could be reduced between 15 and 40 percent in the CNNs studied without harming performance, and in fact offering gains in many instances, setting a firm example of the importance of pruning models after training.
Although this study was clearly motivated by efficiency concerns, we know from decades of evidence linking reduced model complexity to improved generalization that these networks should perform better on unseen data as well, a fundamental advantage which motivated pruning research in the first place. However, this pruning method requires a sensitivity analysis of the network layers in order to be done correctly, requiring additional effort and computation. Further, as LeCun and his colleagues correctly pointed out back in 1989: although magnitude-based pruning is a time-tested strategy, we should expect a theoretically justified metric of salience to produce a superior pruning strategy, but with the size of modern neural networks, computing the Hessian matrix required for the second-order Taylor expansions used in their OBD method would be too intensive. Fortunately, a happy medium was forthcoming.
Trailing Li et al. by only a few months in late 2016, Molchanov and his colleagues at Nvidia reinvestigated the use of Taylor expansion to quantify salience for structured pruning of filters from CNNs. In contrast to OBD, they avoid the complex calculation of the second-order terms, and instead extract a useful measure of saliency by considering the variance rather than the mean of the first-order Taylor expansion term. The study provides empirical comparison of several saliency measures against an “oracle” ranking which was computed by exhaustively calculating the change in loss caused by removing each filter from a fine-tuned VGG-16. In the results shown below, we can see that the proposed Taylor expansion saliency measure most closely correlates with the oracle rankings, followed in second place by the more computationally intensive OBD, and the performance results reflect that these methods are also best at preserving accuracy, with the advantage more clearly in favor of the proposed Taylor expansion method when plotting over GFLOPs. Interestingly, the inclusion of random filter pruning in their study shows us that it performs surprisingly well compared to minimum weight (magnitude-based) pruning, challenging the notion that weight magnitude is a reliable measure of saliency, at least for the CNN architectures studied.