To Quantize or not to Quantize

Asking the right questions about quantization

Introduction to Quantization

Quantization refers to the technique of representing models using fewer bits by reducing the precision of its parameters. This process involves converting continuous or high-precision values into a smaller set of discrete values, typically by mapping floating-point numbers to integers. The primary goal of quantizing large language models (LLMs) is to decrease memory usage and accelerate inference.

There are several methods to quantize a model, which I won't get into as there are already excellent resources available (see reference [3]). Instead, I wanted to focus on a specific use case I get asked about a lot as an AI consultant and teacher: deploying an off-the-shelf model without further fine-tuning. These models could be ones pre-trained by other organizations, like Llama-3-8B, or previously fine-tuned on specific datasets without quantization. This post will not cover the process of fine-tuning while quantizing, which involves techniques such as QLORA (I have codes examples for this in reference [2]).

Python code to quantize a model is relatively straightforward using popular packages like transformers which have implementations of algorithms like NF4 (see below code sample and reference [3] for more details). NF4, which stands for NormalFloat 4, is a particularly effective strategy for maintaining the performance of AI models. Originally introduced in the QLORA paper, NF4 has become a preferred choice in modern quantization strategies.

# Import necessary classes and functions from the transformers library
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Define the model name to load from Hugging Face's model hub
model_name = 'meta-llama/Meta-Llama-3-8B-Instruct'

# Configure the quantization settings using BitsAndBytesConfig
# Setting load_in_4bit to True enables 4-bit quantization
# bnb_4bit_use_double_quant enables double quantization for more precise control
# bnb_4bit_quant_type specifies the NF4 quantization algorithm
# bnb_4bit_compute_dtype sets the data type for computation to bfloat16 for efficiency
bits_config = BitsAndBytesConfig(

# Initialize the tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load and configure the quantized model
qt_model = AutoModelForCausalLM.from_pretrained(
).eval()  # Set the model to evaluation mode which disables training specific operations like dropout

# Load the non-quantized version of the same model
non_qt_model = AutoModelForCausalLM.from_pretrained(
).eval()  # Set the model to evaluation mode

The not so straightforward part is testing both quantized and non-quantized models side by side on our main three considerations:

  1. Optimizing Inference - memory and latency reduction

  2. Raw token output differences - measuring the raw differences between the next token prediction outputs

  3. Performance on benchmarks / test sets - running generative benchmarks and comparing the two models

I will use be using Meta’s llama-3-8B-Instruct model as my reference.

Consideration 1 - Optimizing Inference

Probably the most well known benefits of quantization are the inference gains both in memory usage and in latency/throughput. Lower parameter precision means less memory required to hold the model and faster computations. The memory usage and latency differences are dramatic between the two models and hold at both small and larger batch sizes.

Measuring the peak memory usage and latency of the forward pass of Llama 3-8B shows striking differences. The Non-Quantized model (red) uses far more memory (top) and takes far longer to process inputs in batch sizes between 1 and 32 (bottom).

Quantized models are supposed to be faster and more memory efficient so this is just the tip of the iceberg. Are they as reliable as their non-quantized cousin? Are they better? Worse? Let’s see how we can find out.

Consideration 2 - Raw Token Output Differences

This next graph has me asking both versions of the Llama 3 model 163 questions from a subset of MMLU-Virology (the benchmark content isn’t as relevant here) and using the Jaccard Index (Similarity) - a similarity metric between two sets as the number of items they have in common divided by the total number of unique items between them - to measure the differences between the raw next token predictions for each input at various cutoff points - k=1, 2, 3, etc. This will give us a straightforward way to quantify the differences in raw model output of quantized and non-quantized models.

I chose the Jaccard Index also for its robustness in scenarios where the exact alignment of token sets is less important than the overall overlap, making it ideal for evaluating models where slight deviations in token predictions are acceptable. We can see that most tokens are in common but a non-insignificant number of tokens are in fact different.

The Jaccard similarity between the top k predicted tokens of the quantized and non-quantized model on a subset of MMLU-virology

Given this graph, roughly speaking, we can expect about 75-80% of the tokens to match in the top 1, 3, 5, 10, and 20 predicted tokens for this test set, which can lead to performance differences (see consideration 3). These raw token outputs will not only affect performance on test sets but will also yield differences in the inference parameters that we set. For example, setting a top_p (which affects token probabilities) for a non-quantized model might yield drastically different results on the quantized version.

Consideration 3 - Performance on Test Sets

Considerations 1 and 2 were measuring the differences in raw next token predictions both in similarity and in speed/memory usage but neither were considering the accuracy of what those tokens represented. We saw non-insignificant differences between which tokens might be outputted which suggests that there will be differences in benchmark performance.

I’m planning a post on benchmarking in more detail but for now, I’m going to pass a very simple 0-shot prompt to each model on a subset of MMLU-Virology. I measured the words per minute (which I expected to be better for the quantized model) and the accuracy on the multiple choice questions.

Note: The only inference parameter I set was a temperature of 0.1 to induce some more consistency and reproducibility of the experiment. This choice will also highlight any token differences by making the differences in token probabilities sharper.

The Quantized Model (Red in both graphs) has a better word per minute rate (top) but performs slightly worse on a subset of the MMLU benchmark (bottom).

Right out of the gate, the non quantized model is performing slightly better on this benchmark subset but has a much lower WPM (no surprise there given the forward pass calculations in consideration 1). The difference in performance comes down to the fact that quantization is objectively altering the model from how it was trained. It’s not always going to be true that the quantized version of a model will perform worse but especially on well known benchmarks like MMLU that companies like Meta, OpenAI, Anthropic, etc test their models on, it’s a good bet. It’s always good to test.

To mitigate this, we could fine-tune the model while quantizing using a technique like QLORA (reference [2]).


Quantization offers tangible benefits in terms of reducing memory usage and enhancing the speed of computations. This has been demonstrated effectively in the case of Llama-3-8B, where quantized models significantly outperform their non-quantized counterparts in memory efficiency and processing speed during inference.

However, quantization does come with built in trade-offs. The alterations in precision can lead to differences in token output and potentially affect performance on benchmarks and practical applications. The balance between efficiency and accuracy must be carefully tested and managed, and for critical applications, performing some fine-tuning post-quantization using QLORA may be necessary to restore or enhance model performance.

I hope this helps!


[1] Code for these graphs

[2] QLORA example (see the SFT notebook in colab)

[3] Primer on Quantization from HuggingFace: