Definition
Parameters are the internal numerical variables of a neural network that are learned during training. They store the model's "knowledge" — all the patterns, facts, and relationships extracted from the training data. In LLMs, parameters are the weights and biases in every matrix multiplication throughout the Transformer architecture.
What Parameters Actually Are
- Floating-point numbers (typically float16 or bfloat16 in modern LLMs)
- Organized into matrices and vectors throughout the model
- Adjusted via gradient descent during training to minimize loss
- Frozen (fixed) after training unless fine-tuning occurs
Llayersdmodel dimensionVvocabulary sized_ffn= 4d (standard FFN expansion)- Factual knowledge: "Paris is the capital of France"
- Grammar and syntax: sentence structure rules
- Reasoning patterns: logical inference chains
- Style and tone: formal vs. informal registers
- World model: physical and social intuitions
- Parameter sharing: some architectures reuse weights across layers
- LoRA (Low-Rank Adaptation): fine-tune a tiny fraction of parameters by decomposing weight updates into low-rank matrices
- Quantization: reduce parameter precision to save memory
- Pruning: zero out low-magnitude parameters to reduce compute
- Pre-training, Fine-Tuning, LoRA, Quantization, Attention, Embeddings, LLM
Where Parameters Live in a Transformer
| Component | Parameter Type | Role |
|-----------|---------------|------|
| Token Embedding Matrix | [vocab_size × d_model] | Maps token IDs to vectors |
| Q, K, V Projection Matrices | [d_model × d_head] × heads | Compute attention queries, keys, values |
| Output Projection (Attention) | [d_model × d_model] | Recombines attention heads |
| Feed-Forward Layer 1 | [d_model × d_ffn] | Expands representation |
| Feed-Forward Layer 2 | [d_ffn × d_model] | Projects back down |
| Layer Norm Weights | [d_model] | Scale and shift normalization |
| LM Head (output) | [d_model × vocab_size] | Predicts next token probabilities |
Parameter Count Formula (Approximate)
For a Transformer with:
`
Total ≈ V×d (embeddings) + L × (4×d² (attention) + 8×d² (FFN)) + V×d (LM head)
≈ 2Vd + 12Ld²
`
Scale Reference
| Model | Parameters | Approx. Size (fp16) |
|-------|-----------|---------------------|
| GPT-2 Small | 117M | ~240 MB |
| GPT-3 | 175B | ~350 GB |
| LLaMA 3 8B | 8B | ~16 GB |
| LLaMA 3 70B | 70B | ~140 GB |
| GPT-4 (estimated) | ~1T | ~2 TB |
Parameter Precision (Data Types)
| Format | Bits | Memory per param | Notes |
|--------|------|-----------------|-------|
| float32 (fp32) | 32 | 4 bytes | Training default |
| float16 (fp16) | 16 | 2 bytes | Inference/fine-tuning |
| bfloat16 (bf16) | 16 | 2 bytes | Better range than fp16, preferred |
| int8 | 8 | 1 byte | Quantized inference |
| int4 | 4 | 0.5 bytes | Aggressive quantization (GGUF, GPTQ) |
What Parameters Encode
Parameters implicitly store:
This is sometimes called parametric knowledge — distinct from information retrieved at runtime (RAG).
Parameters vs. Hyperparameters
| Type | Definition | Examples | When Set |
|------|-----------|---------|---------|
| Parameters | Learned weights inside the model | Embedding matrix values, attention weights | During training |
| Hyperparameters | Configuration choices about training/architecture | Learning rate, batch size, num layers, d_model | Before training |