Beginner·3 min read

Parameters

Parameters are the internal numerical variables of a neural network that are learned during training. They store the model's "knowledge" — all the pat

Definition

Parameters are the internal numerical variables of a neural network that are learned during training. They store the model's "knowledge" — all the patterns, facts, and relationships extracted from the training data. In LLMs, parameters are the weights and biases in every matrix multiplication throughout the Transformer architecture.

What Parameters Actually Are

  • Floating-point numbers (typically float16 or bfloat16 in modern LLMs)
  • Organized into matrices and vectors throughout the model
  • Adjusted via gradient descent during training to minimize loss
  • Frozen (fixed) after training unless fine-tuning occurs
  • Where Parameters Live in a Transformer

    | Component | Parameter Type | Role |

    |-----------|---------------|------|

    | Token Embedding Matrix | [vocab_size × d_model] | Maps token IDs to vectors |

    | Q, K, V Projection Matrices | [d_model × d_head] × heads | Compute attention queries, keys, values |

    | Output Projection (Attention) | [d_model × d_model] | Recombines attention heads |

    | Feed-Forward Layer 1 | [d_model × d_ffn] | Expands representation |

    | Feed-Forward Layer 2 | [d_ffn × d_model] | Projects back down |

    | Layer Norm Weights | [d_model] | Scale and shift normalization |

    | LM Head (output) | [d_model × vocab_size] | Predicts next token probabilities |

    Parameter Count Formula (Approximate)

    For a Transformer with:

  • L layers
  • d model dimension
  • V vocabulary size
  • d_ffn = 4d (standard FFN expansion)
  • `

    Total ≈ V×d (embeddings) + L × (4×d² (attention) + 8×d² (FFN)) + V×d (LM head)

    ≈ 2Vd + 12Ld²

    `

    Scale Reference

    | Model | Parameters | Approx. Size (fp16) |

    |-------|-----------|---------------------|

    | GPT-2 Small | 117M | ~240 MB |

    | GPT-3 | 175B | ~350 GB |

    | LLaMA 3 8B | 8B | ~16 GB |

    | LLaMA 3 70B | 70B | ~140 GB |

    | GPT-4 (estimated) | ~1T | ~2 TB |

    Parameter Precision (Data Types)

    | Format | Bits | Memory per param | Notes |

    |--------|------|-----------------|-------|

    | float32 (fp32) | 32 | 4 bytes | Training default |

    | float16 (fp16) | 16 | 2 bytes | Inference/fine-tuning |

    | bfloat16 (bf16) | 16 | 2 bytes | Better range than fp16, preferred |

    | int8 | 8 | 1 byte | Quantized inference |

    | int4 | 4 | 0.5 bytes | Aggressive quantization (GGUF, GPTQ) |

    What Parameters Encode

    Parameters implicitly store:

  • Factual knowledge: "Paris is the capital of France"
  • Grammar and syntax: sentence structure rules
  • Reasoning patterns: logical inference chains
  • Style and tone: formal vs. informal registers
  • World model: physical and social intuitions
  • This is sometimes called parametric knowledge — distinct from information retrieved at runtime (RAG).

    Parameters vs. Hyperparameters

    | Type | Definition | Examples | When Set |

    |------|-----------|---------|---------|

    | Parameters | Learned weights inside the model | Embedding matrix values, attention weights | During training |

    | Hyperparameters | Configuration choices about training/architecture | Learning rate, batch size, num layers, d_model | Before training |

    Efficient Parameter Use

  • Parameter sharing: some architectures reuse weights across layers
  • LoRA (Low-Rank Adaptation): fine-tune a tiny fraction of parameters by decomposing weight updates into low-rank matrices
  • Quantization: reduce parameter precision to save memory
  • Pruning: zero out low-magnitude parameters to reduce compute
  • Related Concepts

  • Pre-training, Fine-Tuning, LoRA, Quantization, Attention, Embeddings, LLM

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 2).