Skip to content

Latest commit

 

History

History
236 lines (213 loc) · 9.48 KB

CompressWeights.md

File metadata and controls

236 lines (213 loc) · 9.48 KB

Weights Compression

OpenVINO is the preferred backend to run Weights Compression with, and PyTorch is also supported.

The algorithm description

The Weights Compression algorithm is aimed at compressing the weights of the models and can be used to optimize the model footprint and performance of large models where the size of weights is relatively larger than the size of activations, for example, Large Language Models (LLM). The algorithm compresses weights only for Linear and Embedding layers.

Supported modes

By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode. OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM and NF4. The primary precision in case of INT4_SYM mode is unsigned 4-bit integer and weights are quantized to it symmetrically with a fixed zero point equals to 8. In case of INT4_ASYM mode - also unsigned 4-bit integer, but weight are quantized to it asymmetrically with a typical non-fixed zero point. In case of NF4 mode - nf4 data type without zero point. All 4-bit modes have a grouped quantization support, when small group of weights (e.g. 128) in the channel dimension share quantization parameters (scale). All embeddings and last linear layers are always compressed to 8-bit integer data type. Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit asymmetric integer data type.

User guide

  • Compress weights asymmetrically to 8-bit integer data type.
from nncf import compress_weights
compressed_model = compress_weights(model)
  • Compress weights symmetrically to 8-bit integer data type.
from nncf import compress_weights
from nncf import CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.INT8_SYM)
  • Compress weights symmetrically to 4-bit integer data type with group size = 128, except embeddings and last linear layers - they are compressed asymmetrically to 8-bit integer data type.
from nncf import compress_weights
from nncf import CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_SYM)
  • Generally, INT4_SYM mode is the fastest mixed-precision mode, but it may lead to a significant accuracy degradation or perplexity increase. Compressing weights asymmetrically (INT4_ASYM mode) is the way to increase accuracy, however in turns it slows down inference a bit. If the accuracy or perplexity is still not satisfying, there are 2 more hyper-parameters to tune: group_size and ratio. Lower group size and less ratio of 4-bit layers usually improve accuracy at the sacrifice of inference speed. Below is the example how to compress weights of 90% of layers to 4-bit integer asymmetrically with the group size 64, and the rest of layers to 8-bit asymmetric integer data type. The same parametrization is applicable for INT4_SYM mode.
from nncf import compress_weights
from nncf import CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_ASYM, group_size=64, ratio=0.9)
  • NF4 mode can be considered for improving accuracy, but currently models quantized to nf4 should not be faster models quantized to 8-bit asymmetric integer. Here's the example how to compress weights to nf4 data type with group size = 128. Different group_size and ratio are also supported.
from nncf import compress_weights
from nncf import CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.NF4)

Evaluation results

Here is the perplexity and model size before and after weight compression for different language models on the Lambada OpenAI dataset. g32 refers to the group size equals to 32, r60 - to the ratio equals to 0.6.

Model Mode Perplexity Perplexity
Increase
Model Size
(Gb)
databricks/dolly-v2-3b fp32 5.01 0 10.3
databricks/dolly-v2-3b int8_asym 5.07 0.05 2.6
databricks/dolly-v2-3b int4_asym_g32_r50 5.28 0.26 2.2
databricks/dolly-v2-3b nf4_g128_r60 5.19 0.18 1.9
facebook/opt-6.7b fp32 4.25 0 24.8
facebook/opt-6.7b int8_asym 4.27 0.01 6.2
facebook/opt-6.7b int4_asym_g64_r80 4.32 0.07 4.1
facebook/opt-6.7b nf4_g64 4.35 0.1 3.6
meta-llama/Llama-2-7b-chat-hf fp32 3.28 0 25.1
meta-llama/Llama-2-7b-chat-hf int8_asym 3.29 0.01 6.3
meta-llama/Llama-2-7b-chat-hf int4_asym_g128_r80 3.41 0.14 4.0
meta-llama/Llama-2-7b-chat-hf nf4_g128 3.41 0.13 3.5
togethercomputer/RedPajama-INCITE-7B-Instruct fp32 4.15 0 25.6
togethercomputer/RedPajama-INCITE-7B-Instruct int8_asym 4.17 0.02 6.4
togethercomputer/RedPajama-INCITE-7B-Instruct nf4_ov_g32_r60 4.28 0.13 5.1
togethercomputer/RedPajama-INCITE-7B-Instruct int4_asym_g128 4.17 0.02 3.6
meta-llama/Llama-2-13b-chat-hf fp32 2.92 0 48.5
meta-llama/Llama-2-13b-chat-hf int8_asym 2.91 0 12.1
meta-llama/Llama-2-13b-chat-hf int4_sym_g64_r80 2.98 0.06 8.0
meta-llama/Llama-2-13b-chat-hf nf4_g128 2.95 0.04 6.6

Limitations

  • The algorithm is supported for OpenVINO and PyTorch models.
  • The compression applies in-place.
  • The compressed model is not trainable.
  • INT8_SYM, INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
  • NF4 support is experimental - models quantized to nf4 should not be faster models quantized to 8-bit integer.

Additional resources