Model quantization stands as a cornerstone technique for deploying expansive machine learning models within resource-constrained environments, significantly curbing both training and inference expenses. This is particularly vital for Large Language Models (LLMs). Tools like bitsandbytes have democratized access to large models, enabling their utilization on consumer-grade GPUs, a transformative advancement for the machine learning community. This progress underscores the growing importance of Quadratic Machine Learning techniques, especially those that enhance model efficiency without compromising performance.
In the realm of weight-only quantization, methodologies bifurcate into two primary categories: data-free calibration and calibration-based methods. Data-free techniques, exemplified by bitsandbytes, operate solely on model weights, eschewing external data. Conversely, calibration-based approaches like GPTQ and AWQ leverage external datasets to refine quantization. While calibration-based methods typically yield superior quantization quality, they are encumbered by two principal challenges:
- Calibration data bias: The selection and nature of calibration data can exert a considerable influence on quantization quality, potentially introducing bias and affecting generalization.
- Quantization time: The calibration process, particularly for colossal models, can be computationally intensive and time-consuming, impeding rapid experimentation and deployment across diverse models.
The ideal scenario would be to meld the high fidelity of calibration-based quantization with the rapid pace of calibration-free methods. This is the ambition driving our proposed Half-Quadratic Quantization (HQQ) method, a novel approach to quadratic machine learning model optimization.
Unveiling Half-Quadratic Quantization
Basic quantization often leads to a discernible erosion of model accuracy. This degradation is attributed to the wide dynamic range of weights within these models, which can undergo substantial alterations during the quantization process. Outlier weights, those deviating significantly from the central distribution, present a particularly formidable challenge. Algorithms like Group-wise Precision Tuning Quantization (GPTQ) and Activation-Aware Layer Quantization (AWQ) mitigate this issue by employing calibration data to minimize errors in layer outputs.
In contrast to these activation-centric methods, our Half-Quadratic Quantization (HQQ) method zeroes in on minimizing errors directly within the weights themselves. Furthermore, by integrating a sparsity-promoting loss function, such as the ( {l_{p<1}} )-norm, HQQ adeptly models outliers through a hyper-Laplacian distribution. This distribution more accurately mirrors the heavy-tailed characteristics of outlier errors compared to traditional squared error approaches, affording a more refined representation of error distribution in quadratic machine learning models.
To determine optimal quantization parameters—zero-point ( z ) and scaling ( s )—we introduce a robust optimization formulation. Specifically, we employ a sparsity-promoting loss function ( phi() ), such as the ( {l_{p}} )-norm, to minimize the discrepancy between the original weights ( W ) and their dequantized counterparts:
$$underset{z,s}{text{argmin}},phi(W-Q{z,s}^{-1}(Q{z,s}(W)),$$
Here, ( Q_{z,s}() ) represents the quantization operator, parameterized by ( z ) and ( s ), which generates the quantized weights ( W_{q} ). ( Q_{z,s}()^{-1} ) is the corresponding de-quantization operator:
$$begin{array}{c} Q{z,s}(W)=text{round}(W/s+z)=W{q} Q{z,s}^{-1}(W{q})=s(W_{q}-z) end{array}$$
The adoption of the ( {l_{p<1}} )-norm introduces non-convexity into the optimization problem. To navigate this, we utilize a Half-Quadratic solver, facilitated by the introduction of an auxiliary variable ( W_{e} ). This auxiliary parameter enables the decomposition of the primary problem into more tractable sub-problems. For simplification, we fix the scaling parameter ( s ) and focus solely on optimizing the zero-point ( z ).
$$underset{z,W{e}}{text{argmin}},phi(W{e})+frac{beta}{2}||W{e}-(W-Q{z}^{-1}(Q{z}(W))||{2}^{2}$$
This formulation leads to a series of sub-problems, iteratively solved through alternate optimization:
$$begin{array}{cc} text{(sp}{1}) & W{e}^{(t+1)}leftarrowunderset{}{underset{W{e}}{text{argmin}},phi(W{e})+frac{beta^{(t)}}{2}||W{e}-(W-Q{z}^{-1}(Q{z}(W))||{2}^{2}} text{(sp}{2}) & z^{(t+1)}leftarrowunderset{z}{text{argmin}},frac{1}{2}||Q{z}^{-1}(Q{z}(W))-(W-W{e}^{(t+1)})||_{2}^{2} & beta^{(t+1)}leftarrowkappabeta^{(t)},end{array}$$
where ( beta ) and ( kappa ) are strictly positive parameters.
Sub-problem ( text{(sp}_{1}) )
This sub-problem aligns with the structure of a Proximal Operator. When ( phi() ) is the ( l_{1} ) norm, the solution is given by the soft-thresholding operator. For the ( l_{p})-norm with ( 0 le p leq 1 ), we employ a generalized thresholding solution known as the generalized soft-thresholding operator, which is particularly effective in quadratic machine learning contexts where sparsity is desired:
$$begin{array}{c} W{e}^{(t+1)}leftarrowtext{shrink}{l{p}}left(W-Q{z}^{-1}(Q{z}(W)),betaright) text{shrink}{l_{p}}(x,beta)=text{sign}(x)text{relu}(|x|-frac{|x|^{p-1}}{beta}) end{array}$$
Sub-problem ( text{(sp}_{2}) )
The second sub-problem can be reformulated as:
$$begin{array}{c} z^{(t+1)}leftarrowunderset{z}{text{argmin}},frac{1}{2}||z-left(W{q}^{(t+1)}-frac{(W-W{e}^{(t+1)})}{s}right)||{2}^{2} W{q}^{(t+1)}=text{round}(W/s+z^{(t)}) end{array}$$
The solution is simply the average taken across the axis corresponding to the quantization grouping:
$$z^{(t+1)}leftarrowlangle W{q}^{(t+1)}-frac{(W-W{e}^{(t+1)})}{s}rangle$$
In our practical implementation, we operate with the inverse of the scale, ( 1/s ), rather than ( s ) itself, which we observed to enhance stability, especially in half-precision computations.
A key advantage of our approach lies in its reliance on closed-form solutions, contrasting with gradient descent methods using autograd. This eliminates the need for gradient calculations, enabling all computations to be performed in inference mode with half-precision. Furthermore, the solver typically converges within a few iterations. In contrast, methods employing AdamW optimizers and Pytorch’s autograd necessitate thousands of iterations to achieve comparable results and often falter with ( p < 1 ), a condition crucial for promoting sparsity in quadratic machine learning weight matrices. Thanks to the Half-Quadratic solution, our quantization method achieves a remarkable speed advantage—over 100x faster than autograd for quantizing Llama-2-7B—making it feasible to process even the most extensive models in mere minutes.
Quantization Speed Benchmarks
We assessed the processing time required to quantize the Llama-2 model family. We observed significant variability in processing times for GPTQ and AWQ across different hardware configurations. HQQ, however, conducts the entire quantization process on the GPU in half-precision, utilizing the CPU solely for initial data transfer to the GPU and final data retrieval after the solver concludes. HQQ quantizes even the largest Llama-2-70B model in just a few minutes, exhibiting a speed improvement of over 50x compared to GPTQ, highlighting its efficiency in quadratic machine learning model deployment.
Time comparison for quantizing Llama-2-7B model using different methods
Quantization time comparison for Llama-2-13B models across various quantization techniques
Processing time to quantize the large Llama-2-70B model with different quantization approaches
Performance Evaluation
Llama-2 Performance Metrics
To evaluate the quantization quality of HQQ, we employed the perplexity metric (PPL) on the widely recognized wikitext2 dataset. We also report the GPU memory footprint in GB (MEM) required to execute the quantized model. (Note that prediction may require additional memory depending on sequence length). We benchmarked against prevalent methods in the community: BNB (bitsandbytes), GPTQ via AutoGPTQ, and AWQ via AutoAWQ.
For HQQ, we fixed solver parameters as follows: p=0.7, beta=1, kappa=1.01, iterations=20. We also incorporated early-stopping to terminate the solver if error reduction plateaued. Parameter tuning was not extensively explored, suggesting potential for further performance gains with optimized settings. Consistent with other methods, we used grouping for weight quantization into buffers (_g128 denotes a group-size of 128). The zero-point was quantized to 8-bit without grouping or optimization.
Method | nBits | Llama-2-7B | Llama-2-13B | Llama-2-70B |
---|---|---|---|---|
PPL ↓ | MEM ↓ | PPL ↓ | MEM ↓ | PPL ↓ |
FP | 16 | 5.18 | 13.5 | 4.63 |
BNB | 8 | 5.22 | 7.9 | 4.67 |
GPTQ_g128 | 8 | 5.19 | 7.8 | 4.63 |
HQQ_g128 | 8 | 5.19 | 7.6 | 4.63 |
BNB_g64 | 4 | 5.43 | 4.7 | 4.79 |
GPTQ_g128 | 4 | 5.41 | 5 | 4.74 |
GPTQ_g64 | 4 | 5.38 | 5 | 4.73 |
AWQ_g128 | 4 | 5.32 | 4.6 | 4.71 |
AWQ_g64 | 4 | 5.28 | 4.6 | 4.7 |
HQQ_g128 | 4 | 5.35 | 4.6 | 4.74 |
HQQ_g64 | 4 | 5.3 | 4.6 | 4.7 |
GPTQ_g128 | 3 | 6.3 | 3.9 | 5.25 |
GPTQ_g64 | 3 | 6.1 | 4 | 5.16 |
HQQ_g128 | 3 | 6.2 | 3.8 | 5.15 |
HQQ_g64 | 3 | 5.82 | 4.5 | 4.98 |
GPTQ_g64 | 2 | nan | 3.5 | 13 |
HQQ_g32 | 2 | 15.61 | 3.5 | 7.63 |
HQQ_g16 | 2 | 7.3 | 4.1 | 6.36 |
HQQ_g16_s* | 2 | 7.31 | 3.7 | 6.37 |
*: the scaling is also quantized to 8-bits with a group-size of 128.
The results demonstrate HQQ’s robust performance without calibration data. For larger models like Llama-2-70B, 2-bit HQQ quantization achieves lower perplexity than full-precision Llama-2-13B, with comparable memory usage, showcasing the potential of quadratic machine learning for extreme compression.
ViT Performance Metrics
We also assessed HQQ’s effectiveness on vision models, quantizing various OpenCLIP models from the Visual Transformers (ViT) family, trained on the LAION dataset. Due to Auto-GPTQ and Auto-AWQ calibration limitations to text inputs, we compared HQQ only against bitsandbytes, replacing linear layers within transformer blocks with quantized versions.
We conducted two benchmark sets, reporting top-1 and top-5 accuracy on ImageNet. The first benchmark evaluated zero-shot performance using OpenAI prompts to create zero-shot classifiers. The second benchmark used quantized models as frozen backbones, training a linear Softmax classifier (Linear Probing).
Method | nBits | Model | Linear (top-1) | Linear (top-5) | 0-shot (top-1) | 0-shot (top-5) |
---|---|---|---|---|---|---|
FP | 16 | ViT-B-32 | 0.764 | 0.941 | 0.664 | 0.896 |
FP | 16 | ViT-L-14 | 0.82 | 0.964 | 0.731 | 0.93 |
FP | 16 | ViT-H-14 | 0.841 | 0.973 | 0.772 | 0.949 |
BNB | 8 | ViT-B-32 | 0.762 | 0.94 | 0.663 | 0.896 |
HQQ | 8 | ViT-B-32 | 0.763 | 0.941 | 0.663 | 0.896 |
BNB | 8 | ViT-L-14 | 0.82 | 0.964 | 0.731 | 0.93 |
HQQ | 8 | ViT-L-14 | 0.82 | 0.964 | 0.731 | 0.93 |
BNB | 8 | ViT-H-14 | 0.84 | 0.972 | 0.771 | 0.949 |
HQQ | 8 | ViT-H-14 | 0.841 | 0.973 | 0.772 | 0.95 |
BNB | 4 | ViT-B-32 | 0.733 | 0.925 | 0.608 | 0.859 |
HQQ | 4 | ViT-B-32 | 0.75 | 0.933 | 0.639 | 0.881 |
BNB | 4 | ViT-L-14 | 0.815 | 0.961 | 0.718 | 0.925 |
HQQ | 4 | ViT-L-14 | 0.815 | 0.962 | 0.721 | 0.926 |
BNB | 4 | ViT-H-14 | 0.837 | 0.971 | 0.766 | 0.947 |
HQQ | 4 | ViT-H-14 | 0.839 | 0.973 | 0.769 | 0.948 |
HQQ | 3 | ViT-B-32 | 0.664 | 0.881 | 0.481 | 0.753 |
HQQ | 3 | ViT-L-14 | 0.799 | 0.954 | 0.689 | 0.909 |
HQQ | 3 | ViT-H-14 | 0.831 | 0.969 | 0.755 | 0.943 |
HQQ | 2 | ViT-B-32 | 0.318 | 0.551 | 0.04 | 0.106 |
HQQ | 2 | ViT-L-14 | 0.731 | 0.917 | 0.559 | 0.815 |
HQQ | 2 | ViT-H-14 | 0.808 | 0.96 | 0.716 | 0.924 |
HQQ consistently outperforms 4-bit bitsandbytes (BNB) in zero-shot performance (+3.1% top-1 accuracy with ViT-B-32). Remarkably, a 3-bit quantized ViT-H-14 surpasses the full-precision ViT-L-14 (+2.4% top-1 zero-shot accuracy), and a 2-bit ViT-H-14 exceeds full-precision ViT-32-B (+5.2% top-1 zero-shot accuracy), showcasing the potential of HQQ for highly efficient quadratic machine learning models.
Conclusion
This study demonstrates that calibration-free quantization via Half-Quadratic Quantization (HQQ) achieves performance on par with data-dependent methods like GPTQ and AWQ. HQQ proves effective even in extreme low-bit quantization across diverse model sizes and applications. By leveraging efficient Half-Quadratic splitting, HQQ reduces quantization time to minutes, even for models like Llama-2-70B, highlighting its practical utility in the field of quadratic machine learning.
Code to reproduce these results is available: https://github.com/mobiusml/hqq
Pre-quantized models are accessible on our Hugging Face page: https://huggingface.co/mobiuslabsgmbh
Citation
@misc{badri2023hqq, title = {Half-Quadratic Quantization of Large Machine Learning Models}, url = {https://mobiusml.github.io/hqq_blog/}, author = {Hicham Badri and Appu Shaji}, month = {November}, year = {2023} }
Feel free to contact us..