Intermediate·4 min read

Transformer

The Transformer is the neural network architecture that underlies all modern LLMs. Introduced in the paper "Attention Is All You Need" (Vaswani et al.

Definition

The Transformer is the neural network architecture that underlies all modern LLMs. Introduced in the paper "Attention Is All You Need" (Vaswani et al., Google Brain, 2017), it replaced recurrent networks (RNNs/LSTMs) by processing all tokens in parallel using self-attention, enabling massive scaling.

Why Transformers Replaced RNNs

| Aspect | RNN/LSTM | Transformer |

|--------|----------|-------------|

| Processing | Sequential (token by token) | Parallel (all tokens at once) |

| Long-range dependencies | Poor (vanishing gradient) | Excellent (direct attention) |

| Training speed | Slow (can't parallelize) | Fast (GPU-friendly) |

| Scaling | Difficult | Scales to billions of params |

Architecture Overview

`

Input Text

[Tokenizer] → Token IDs

[Embedding Layer] → Token Vectors + Positional Encoding

┌─────────────────────────────────┐

│ Transformer Block × N │

│ ┌───────────────────────────┐ │

│ │ Multi-Head Self-Attention │ │

│ │ + Residual Connection │ │

│ │ + Layer Normalization │ │

│ └───────────────────────────┘ │

│ ┌───────────────────────────┐ │

│ │ Feed-Forward Network │ │

│ │ + Residual Connection │ │

│ │ + Layer Normalization │ │

│ └───────────────────────────┘ │

└─────────────────────────────────┘

[LM Head] → Logits over vocabulary

[Softmax] → Probability distribution

Next Token

`

Core Components

1. Input Embedding + Positional Encoding

  • Token IDs → dense embedding vectors (learned lookup)
  • Positional encodings added so the model knows token order
  • Modern: RoPE (Rotary Position Embedding) used in LLaMA, Mistral
  • 2. Multi-Head Self-Attention

  • The defining innovation of Transformers
  • Computes relevance between every pair of tokens
  • Multiple "heads" allow attending to different relationship types simultaneously
  • See: Attention spec
  • 3. Feed-Forward Network (FFN / MLP)

  • Applied independently to each token after attention
  • Two linear layers with a nonlinear activation (ReLU or GeLU) in between
  • Width typically 4× the model dimension
  • Stores factual knowledge / concept-specific transformations
  • 4. Residual Connections

  • Each sublayer (attention + FFN) has a skip connection: output = sublayer(x) + x
  • Allows gradients to flow directly through the network depth
  • Essential for training very deep (100+ layer) models
  • 5. Layer Normalization

  • Normalizes activations before (Pre-LN) or after (Post-LN) each sublayer
  • Stabilizes training; Pre-LN (used by modern LLMs) is more stable at scale
  • Decoder-Only vs. Encoder-Decoder

    Decoder-Only (Autoregressive) — Most LLMs

  • Processes tokens left-to-right
  • Each token only attends to itself and previous tokens (causal masking)
  • Used by: GPT family, LLaMA, Claude, Mistral, Gemma
  • Best for: text generation, instruction following
  • Encoder-Only (Bidirectional)

  • Each token attends to all tokens (both directions)
  • Used by: BERT, RoBERTa, DeBERTa
  • Best for: classification, embedding, understanding tasks
  • Encoder-Decoder (Seq2Seq)

  • Encoder processes input bidirectionally; decoder generates output autoregressively
  • Used by: T5, BART, mT5, Flan-T5
  • Best for: translation, summarization with separate input/output sequences
  • The Attention Mechanism (Key Formula)

    `

    Attention(Q, K, V) = softmax(QK^T / √d_k) × V

    Where:

    Q = Query matrix (what am I looking for?)

    K = Key matrix (what do I contain?)

    V = Value matrix (what information do I provide?)

    d_k = dimension of key vectors (scaling factor)

    `

    Scaling Properties

  • Transformers scale gracefully: more layers + wider dimensions = more capability
  • Scaling Laws (Kaplan et al., 2020; Chinchilla, 2022) show predictable improvement with scale
  • This predictability enabled the LLM revolution — labs can estimate what a bigger model will do
  • Computational Complexity

  • Self-attention: O(n²d) where n = sequence length, d = model dimension
  • This quadratic scaling with sequence length is why context windows are expensive
  • Linear attention variants (Mamba, RWKV) attempt to address this
  • Key Architecture Variants

    | Model | Architecture | Innovation |

    |-------|-------------|-----------|

    | GPT-2/3/4 | Decoder-only | Scaled original Transformer |

    | LLaMA | Decoder-only | RoPE, RMSNorm, SwiGLU FFN |

    | Mistral | Decoder-only | Grouped Query Attention, Sliding Window Attention |

    | T5 | Encoder-Decoder | Unified text-to-text framework |

    | BERT | Encoder-only | Bidirectional pretraining (MLM) |

    Related Concepts

  • Attention, Embeddings, Positional Encoding, Parameters, Pre-training, Inference, Scaling Laws

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 8).