Definition
Mixture of Experts (MoE) is a neural network architecture where only a subset of the model's parameters are activated for each input token. Instead of one monolithic feed-forward network, MoE replaces the FFN layer with multiple "expert" FFNs plus a routing mechanism that selects which experts handle each token. This enables massive model capacity with sub-linear compute costs.
The Core Idea
A 140B parameter dense model activates all 140B parameters for every token.
A 140B parameter MoE model might activate only 20B parameters per token (by routing to 2 of 8 experts).
`
Dense FFN: MoE FFN:
Input → [FFN] → Output Input → [Router] → Expert 2 ┐
→ Expert 5 ┘→ Combine → Output
`
Same (or larger) capacity, much less compute per token.
Architecture
Standard MoE Layer (replaces FFN in each Transformer block)
`
Token embedding
↓
[Router / Gating Network]
↓ selects top-K experts
[Expert 1] [Expert 2] ... [Expert N]
(only K activate for this token)
↓ weighted combination
Output embedding
`
Router / Gating Mechanism
- A small linear layer that maps token embedding → softmax over N experts
- Top-K selection: typically K=2 (each token uses 2 experts)
- The selected experts' outputs are weighted by the router's softmax scores
- A 46B MoE model (Mixtral 8×7B) runs at the compute cost of a ~13B dense model
- Higher capacity/compute ratio than dense models
- Each token only activates 2 experts → 75% of the expert FFN parameters unused per forward pass
- MoE models at the same active parameter count outperform dense models
- Specialists may emerge: different experts handle different types of knowledge
- For the same quality target, MoE requires fewer FLOPs per token
- Better inference throughput when all experts fit in memory
- Mixtral 8×7B: ~46B params → ~92GB in fp16 → requires 2× A100 80GB
- A 13B dense model would require only ~26GB
- Memory ≠ compute: MoE models are compute-efficient but memory-heavy
- Some experts become overloaded
- Others become unused (dead experts)
- Fix: auxiliary load balancing loss added during training to encourage uniform routing
- Router can learn degenerate patterns
- Requires careful initialization and loss balancing
- Communication overhead in distributed training
- Token must be "sent" to the GPU hosting its selected expert
- All-to-all communication overhead at scale
- Some experts activate more for code, others for natural language
- Some activate for specific languages
- Specialization is emergent — not programmed
- Apply LoRA adapters to attention layers
- Optionally apply to expert FFN layers
- Router weights typically frozen during fine-tuning
- Transformer, Parameters, Scaling Laws, Inference, Fine-Tuning, Quantization
`
gates = softmax(W_router × token_embedding)
top_k_indices = argsort(gates)[-K:]
output = Σ gates[i] × Expert_i(token) for i in top_k_indices
`
MoE Parameters
| Parameter | Typical Value | Effect |
|-----------|--------------|--------|
| N (total experts) | 8, 16, 64, 128 | Total model capacity |
| K (active experts per token) | 1, 2 | Compute per token |
| Expert size | Same as dense FFN | Individual expert capacity |
| Router type | Top-K softmax | Routing strategy |
Famous MoE Models
| Model | Experts | Active | Total Params | Active Params |
|-------|---------|--------|-------------|--------------|
| GPT-4 (estimated) | ~16 | ~2 | ~1.8T | ~220B |
| Mixtral 8×7B | 8 | 2 | ~46B | ~12.9B |
| Mixtral 8×22B | 8 | 2 | ~141B | ~39B |
| DeepSeek-V2 | 160 | 6 | 236B | 21B |
| DeepSeek-V3 | 256 | 8 | 671B | 37B |
| Grok-1 | 8 | 2 | 314B | ~85B |
| LLaMA-MoE (various) | Various | Various | Various | Various |
Advantages
Compute Efficiency
Quality
Throughput at Scale
Disadvantages
Memory Requirements
All expert weights must be loaded into GPU memory, even if only 2/8 are used:
Load Balancing Challenge
If all tokens route to the same 1–2 experts ("expert collapse"):
Training Instability
MoE training is harder than dense:
Communication Overhead
In distributed training/inference, different experts may be on different GPUs:
Expert Specialization
Research shows MoE experts do develop specialization:
Fine-tuning MoE Models
Standard LoRA works well for MoE:
MoE vs. Dense: When to Choose
| Use Case | Prefer MoE | Prefer Dense |
|----------|------------|-------------|
| Quality per FLOP | MoE wins | — |
| Memory-constrained | — | Dense wins |
| Fast single-GPU inference | — | Dense wins |
| Large-scale serving | MoE wins | — |
| Fine-tuning ease | Slight edge Dense | — |