Transformer — FDE@ProdAI Blog

Definition

The Transformer is the neural network architecture that underlies all modern LLMs. Introduced in the paper "Attention Is All You Need" (Vaswani et al., Google Brain, 2017), it replaced recurrent networks (RNNs/LSTMs) by processing all tokens in parallel using self-attention, enabling massive scaling.

Why Transformers Replaced RNNs

| Aspect | RNN/LSTM | Transformer |

|--------|----------|-------------|

| Processing | Sequential (token by token) | Parallel (all tokens at once) |

| Long-range dependencies | Poor (vanishing gradient) | Excellent (direct attention) |

| Training speed | Slow (can't parallelize) | Fast (GPU-friendly) |

| Scaling | Difficult | Scales to billions of params |

Architecture Overview

Input Text

↓

[Tokenizer] → Token IDs

↓

[Embedding Layer] → Token Vectors + Positional Encoding

↓

┌─────────────────────────────────┐

│ Transformer Block × N │

│ ┌───────────────────────────┐ │

│ │ Multi-Head Self-Attention │ │

│ │ + Residual Connection │ │

│ │ + Layer Normalization │ │

│ └───────────────────────────┘ │

│ ┌───────────────────────────┐ │

│ │ Feed-Forward Network │ │

│ │ + Residual Connection │ │

│ │ + Layer Normalization │ │

│ └───────────────────────────┘ │

└─────────────────────────────────┘

↓

[LM Head] → Logits over vocabulary

↓

[Softmax] → Probability distribution

↓

Next Token

Core Components

1. Input Embedding + Positional Encoding

Token IDs → dense embedding vectors (learned lookup)
Positional encodings added so the model knows token order
Modern: RoPE (Rotary Position Embedding) used in LLaMA, Mistral

2. Multi-Head Self-Attention

The defining innovation of Transformers
Computes relevance between every pair of tokens
Multiple "heads" allow attending to different relationship types simultaneously
See: Attention spec

3. Feed-Forward Network (FFN / MLP)

Applied independently to each token after attention
Two linear layers with a nonlinear activation (ReLU or GeLU) in between
Width typically 4× the model dimension
Stores factual knowledge / concept-specific transformations

4. Residual Connections

Each sublayer (attention + FFN) has a skip connection: output = sublayer(x) + x
Allows gradients to flow directly through the network depth
Essential for training very deep (100+ layer) models

5. Layer Normalization

Normalizes activations before (Pre-LN) or after (Post-LN) each sublayer
Stabilizes training; Pre-LN (used by modern LLMs) is more stable at scale

Decoder-Only vs. Encoder-Decoder

Decoder-Only (Autoregressive) — Most LLMs

Processes tokens left-to-right
Each token only attends to itself and previous tokens (causal masking)
Used by: GPT family, LLaMA, Claude, Mistral, Gemma
Best for: text generation, instruction following

Encoder-Only (Bidirectional)

Each token attends to all tokens (both directions)
Used by: BERT, RoBERTa, DeBERTa
Best for: classification, embedding, understanding tasks

Encoder-Decoder (Seq2Seq)

Encoder processes input bidirectionally; decoder generates output autoregressively
Used by: T5, BART, mT5, Flan-T5
Best for: translation, summarization with separate input/output sequences

The Attention Mechanism (Key Formula)

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

Where:

Q = Query matrix (what am I looking for?)

K = Key matrix (what do I contain?)

V = Value matrix (what information do I provide?)

d_k = dimension of key vectors (scaling factor)

Scaling Properties

Transformers scale gracefully: more layers + wider dimensions = more capability
Scaling Laws (Kaplan et al., 2020; Chinchilla, 2022) show predictable improvement with scale
This predictability enabled the LLM revolution — labs can estimate what a bigger model will do

Computational Complexity

Self-attention: O(n²d) where n = sequence length, d = model dimension
This quadratic scaling with sequence length is why context windows are expensive
Linear attention variants (Mamba, RWKV) attempt to address this

Key Architecture Variants

| Model | Architecture | Innovation |

|-------|-------------|-----------|

| GPT-2/3/4 | Decoder-only | Scaled original Transformer |

| LLaMA | Decoder-only | RoPE, RMSNorm, SwiGLU FFN |

| Mistral | Decoder-only | Grouped Query Attention, Sliding Window Attention |

| T5 | Encoder-Decoder | Unified text-to-text framework |

| BERT | Encoder-only | Bidirectional pretraining (MLM) |

Related Concepts

Attention, Embeddings, Positional Encoding, Parameters, Pre-training, Inference, Scaling Laws