Intermediate·4 min read

Quantization

Quantization is the process of reducing the numerical precision of a model's weights (and sometimes activations) from higher-bit formats (float32, flo

Definition

Quantization is the process of reducing the numerical precision of a model's weights (and sometimes activations) from higher-bit formats (float32, float16) to lower-bit formats (int8, int4). This shrinks memory usage and increases inference speed, with controlled tradeoffs in model quality.

Why Quantization Is Essential

A 70B parameter model in float16 requires ~140GB of GPU memory — that's two A100 80GB GPUs just to load it. After quantization:

| Precision | Bits/param | 70B Model Size | Runs On |

|-----------|-----------|---------------|---------|

| float32 | 32 | ~280 GB | 4× A100 |

| float16/bfloat16 | 16 | ~140 GB | 2× A100 |

| int8 | 8 | ~70 GB | 1× A100 |

| int4 | 4 | ~35 GB | 2× RTX 4090 |

| int3 | 3 | ~26 GB | 1× RTX 3090 |

| int2 | 2 | ~17 GB | Consumer GPU |

Number Format Basics

| Format | Range | Precision | Use |

|--------|-------|-----------|-----|

| float32 (FP32) | ±3.4×10^38 | ~7 decimal digits | Full training |

| float16 (FP16) | ±65504 | ~3 decimal digits | Training/inference |

| bfloat16 (BF16) | ±3.4×10^38 | ~2 decimal digits | Training preferred |

| int8 | -128 to 127 | 256 discrete values | Efficient inference |

| int4 | -8 to 7 | 16 discrete values | Aggressive inference |

Quantization Methods

Post-Training Quantization (PTQ)

Apply after training is complete — no additional training required:

GPTQ (Generative Pre-trained Transformer Quantization)

  • Uses sample calibration data to minimize quantization error
  • Layer-by-layer quantization using second-order information
  • INT4 quality close to FP16
  • Used in: AWQ, many GGUF models
  • AWQ (Activation-Aware Weight Quantization)

  • Finds the most important weights (high activation magnitude) and protects them
  • Better quality than naive INT4
  • Used widely in production
  • GGUF / llama.cpp quantization

  • Format used by llama.cpp for CPU+GPU inference
  • Multiple quantization levels: Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0
  • Q4_K_M (4-bit with mixed precision) is popular sweet spot for quality/size
  • BitsAndBytes (bitsandbytes library)

  • Dynamic INT8 quantization with LLM.int8()
  • NF4 (Normal Float 4) for QLoRA fine-tuning
  • Easy integration with HuggingFace Transformers
  • Quantization-Aware Training (QAT)

  • Simulate quantization during training so the model learns to be robust to it
  • Better quality than PTQ but requires training
  • Less common for LLMs due to training cost
  • Weight-Only vs. Weight + Activation Quantization

    Weight-Only (W4A16, W8A16)

  • Weights stored in INT4/INT8
  • Activations remain in float16 at inference
  • Simpler to implement, good quality
  • Most common in practice
  • Full Quantization (W8A8, W4A8)

  • Both weights AND activations quantized
  • More hardware-efficient (INT8 matrix multiply is fast on modern hardware)
  • Harder to implement without quality loss
  • Used by NVIDIA TensorRT-LLM, Intel Neural Compressor
  • Quality vs. Size Tradeoff

    Rule of thumb for language models:

    | Quantization | Quality Loss | Use When |

    |-------------|-------------|---------|

    | INT8 (8-bit) | < 1% perplexity degradation | Safe default, minimal loss |

    | INT4 (4-bit) | ~1–5% perplexity degradation | Good balance, widely used |

    | INT3 (3-bit) | ~5–15% degradation | Only when size is critical |

    | INT2 (2-bit) | Severe quality loss | Experimental |

    Larger models tolerate quantization better — a 70B INT4 often beats a 13B FP16.

    Quantization Formats in the Wild

    | Format | Tool | Description |

    |--------|------|-------------|

    | GGUF | llama.cpp, Ollama | CPU/GPU, many quant levels, widely used |

    | GPTQ | AutoGPTQ, HuggingFace | GPU inference, INT4/INT8 |

    | AWQ | AutoAWQ | GPU, activation-aware INT4 |

    | EXL2 | ExLlamaV2 | Flexible mixed-precision, high quality |

    | MLX (4-bit) | Apple MLX | Apple Silicon optimized |

    Flash Attention and Quantization Combined

    Modern inference stacks combine:

  • Quantized weights (INT4 or INT8) for memory efficiency
  • Flash Attention for compute efficiency
  • KV cache in fp16 (often the memory bottleneck for long contexts)
  • When to Quantize

    | Scenario | Recommendation |

    |----------|---------------|

    | Running locally (consumer GPU) | INT4 (GGUF Q4_K_M or AWQ) |

    | Cloud inference at scale | INT8 or FP16 depending on GPU |

    | Fine-tuning with QLoRA | NF4 (4-bit) base + bf16 adapters |

    | Production API serving | FP16 with optional INT8 for large models |

    | Research / accuracy-critical | FP16 or BF16 |

    Practical Tools

    | Tool | Use Case |

    |------|---------|

    | bitsandbytes | HuggingFace integration, NF4/INT8 |

    | AutoGPTQ | GPTQ quantization |

    | AutoAWQ | AWQ quantization |

    | llama.cpp | GGUF format, cross-platform |

    | Ollama | GGUF-based local inference, user-friendly |

    | TensorRT-LLM | NVIDIA production quantization |

    Related Concepts

  • Parameters, Inference, Latency, KV Cache, LoRA/PEFT, Memory, Model Deployment

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 10).