Definition
The Transformer is the neural network architecture that underlies all modern LLMs. Introduced in the paper "Attention Is All You Need" (Vaswani et al., Google Brain, 2017), it replaced recurrent networks (RNNs/LSTMs) by processing all tokens in parallel using self-attention, enabling massive scaling.
Why Transformers Replaced RNNs
| Aspect | RNN/LSTM | Transformer |
|--------|----------|-------------|
| Processing | Sequential (token by token) | Parallel (all tokens at once) |
| Long-range dependencies | Poor (vanishing gradient) | Excellent (direct attention) |
| Training speed | Slow (can't parallelize) | Fast (GPU-friendly) |
| Scaling | Difficult | Scales to billions of params |
Architecture Overview
`
Input Text
↓
[Tokenizer] → Token IDs
↓
[Embedding Layer] → Token Vectors + Positional Encoding
↓
┌─────────────────────────────────┐
│ Transformer Block × N │
│ ┌───────────────────────────┐ │
│ │ Multi-Head Self-Attention │ │
│ │ + Residual Connection │ │
│ │ + Layer Normalization │ │
│ └───────────────────────────┘ │
│ ┌───────────────────────────┐ │
│ │ Feed-Forward Network │ │
│ │ + Residual Connection │ │
│ │ + Layer Normalization │ │
│ └───────────────────────────┘ │
└─────────────────────────────────┘
↓
[LM Head] → Logits over vocabulary
↓
[Softmax] → Probability distribution
↓
Next Token
`
Core Components
1. Input Embedding + Positional Encoding
- Token IDs → dense embedding vectors (learned lookup)
- Positional encodings added so the model knows token order
- Modern: RoPE (Rotary Position Embedding) used in LLaMA, Mistral
- The defining innovation of Transformers
- Computes relevance between every pair of tokens
- Multiple "heads" allow attending to different relationship types simultaneously
- See: Attention spec
- Applied independently to each token after attention
- Two linear layers with a nonlinear activation (ReLU or GeLU) in between
- Width typically 4× the model dimension
- Stores factual knowledge / concept-specific transformations
- Each sublayer (attention + FFN) has a skip connection:
output = sublayer(x) + x - Allows gradients to flow directly through the network depth
- Essential for training very deep (100+ layer) models
- Normalizes activations before (Pre-LN) or after (Post-LN) each sublayer
- Stabilizes training; Pre-LN (used by modern LLMs) is more stable at scale
- Processes tokens left-to-right
- Each token only attends to itself and previous tokens (causal masking)
- Used by: GPT family, LLaMA, Claude, Mistral, Gemma
- Best for: text generation, instruction following
- Each token attends to all tokens (both directions)
- Used by: BERT, RoBERTa, DeBERTa
- Best for: classification, embedding, understanding tasks
- Encoder processes input bidirectionally; decoder generates output autoregressively
- Used by: T5, BART, mT5, Flan-T5
- Best for: translation, summarization with separate input/output sequences
- Transformers scale gracefully: more layers + wider dimensions = more capability
- Scaling Laws (Kaplan et al., 2020; Chinchilla, 2022) show predictable improvement with scale
- This predictability enabled the LLM revolution — labs can estimate what a bigger model will do
- Self-attention: O(n²d) where n = sequence length, d = model dimension
- This quadratic scaling with sequence length is why context windows are expensive
- Linear attention variants (Mamba, RWKV) attempt to address this
- Attention, Embeddings, Positional Encoding, Parameters, Pre-training, Inference, Scaling Laws
2. Multi-Head Self-Attention
3. Feed-Forward Network (FFN / MLP)
4. Residual Connections
5. Layer Normalization
Decoder-Only vs. Encoder-Decoder
Decoder-Only (Autoregressive) — Most LLMs
Encoder-Only (Bidirectional)
Encoder-Decoder (Seq2Seq)
The Attention Mechanism (Key Formula)
`
Attention(Q, K, V) = softmax(QK^T / √d_k) × V
Where:
Q = Query matrix (what am I looking for?)
K = Key matrix (what do I contain?)
V = Value matrix (what information do I provide?)
d_k = dimension of key vectors (scaling factor)
`
Scaling Properties
Computational Complexity
Key Architecture Variants
| Model | Architecture | Innovation |
|-------|-------------|-----------|
| GPT-2/3/4 | Decoder-only | Scaled original Transformer |
| LLaMA | Decoder-only | RoPE, RMSNorm, SwiGLU FFN |
| Mistral | Decoder-only | Grouped Query Attention, Sliding Window Attention |
| T5 | Encoder-Decoder | Unified text-to-text framework |
| BERT | Encoder-only | Bidirectional pretraining (MLM) |