Parameters — FDE@ProdAI Blog

Definition

Parameters are the internal numerical variables of a neural network that are learned during training. They store the model's "knowledge" — all the patterns, facts, and relationships extracted from the training data. In LLMs, parameters are the weights and biases in every matrix multiplication throughout the Transformer architecture.

What Parameters Actually Are

Floating-point numbers (typically float16 or bfloat16 in modern LLMs)
Organized into matrices and vectors throughout the model
Adjusted via gradient descent during training to minimize loss
Frozen (fixed) after training unless fine-tuning occurs

Where Parameters Live in a Transformer

| Component | Parameter Type | Role |

|-----------|---------------|------|

| Token Embedding Matrix | [vocab_size × d_model] | Maps token IDs to vectors |

| Q, K, V Projection Matrices | [d_model × d_head] × heads | Compute attention queries, keys, values |

| Output Projection (Attention) | [d_model × d_model] | Recombines attention heads |

| Feed-Forward Layer 1 | [d_model × d_ffn] | Expands representation |

| Feed-Forward Layer 2 | [d_ffn × d_model] | Projects back down |

| Layer Norm Weights | [d_model] | Scale and shift normalization |

| LM Head (output) | [d_model × vocab_size] | Predicts next token probabilities |

Parameter Count Formula (Approximate)

For a Transformer with:

L layers
d model dimension
V vocabulary size
d_ffn = 4d (standard FFN expansion)

Total ≈ V×d (embeddings) + L × (4×d² (attention) + 8×d² (FFN)) + V×d (LM head)

≈ 2Vd + 12Ld²

Scale Reference

| Model | Parameters | Approx. Size (fp16) |

|-------|-----------|---------------------|

| GPT-2 Small | 117M | ~240 MB |

| GPT-3 | 175B | ~350 GB |

| LLaMA 3 8B | 8B | ~16 GB |

| LLaMA 3 70B | 70B | ~140 GB |

| GPT-4 (estimated) | ~1T | ~2 TB |

Parameter Precision (Data Types)

|--------|------|-----------------|-------|

| int8 | 8 | 1 byte | Quantized inference |

| int4 | 4 | 0.5 bytes | Aggressive quantization (GGUF, GPTQ) |

What Parameters Encode

Parameters implicitly store:

Factual knowledge: "Paris is the capital of France"
Grammar and syntax: sentence structure rules
Reasoning patterns: logical inference chains
Style and tone: formal vs. informal registers
World model: physical and social intuitions

This is sometimes called parametric knowledge — distinct from information retrieved at runtime (RAG).

Parameters vs. Hyperparameters

|------|-----------|---------|---------|

Efficient Parameter Use

Parameter sharing: some architectures reuse weights across layers
LoRA (Low-Rank Adaptation): fine-tune a tiny fraction of parameters by decomposing weight updates into low-rank matrices
Quantization: reduce parameter precision to save memory
Pruning: zero out low-magnitude parameters to reduce compute

Related Concepts

Pre-training, Fine-Tuning, LoRA, Quantization, Attention, Embeddings, LLM