Definition
Scaling laws are empirical relationships that describe how LLM performance (measured by loss) improves predictably and smoothly as a function of three resources: model size (parameters), training data (tokens), and compute (FLOPs). They allow researchers to forecast model capability before training, and to optimally allocate a compute budget.
Why Scaling Laws Matter
Before scaling laws, building better models was trial-and-error. Scaling laws revealed:
- Bigger models with more data = predictably better, following a power law
- You can extrapolate small-model training runs to predict large-model performance
- There are optimal ratios between model size and data for a given compute budget
- This predictability enabled the "just scale it" approach that produced GPT-3, GPT-4, and beyond
- Loss scales as a power law in N (parameters), D (data tokens), and C (compute)
- Performance improves smoothly and predictably with scale — no abrupt phase changes (mostly)
- Larger models are more sample-efficient — they learn more from each token
- Implication: given a fixed compute budget, use the largest model possible even with less data
- Rule of thumb: ~20 tokens per parameter for compute-optimal training
- GPT-3 (175B params, 300B tokens) was significantly undertrained
- Chinchilla (70B params, 1.4T tokens) outperformed Gopher (280B params) using same compute
- Train a smaller model on far more tokens than compute-optimal
- Result: the model is "overtrained" for its size, but more efficient at inference
- Examples: LLaMA (7B trained on 1–2T tokens >> 140B optimal), Mistral 7B
- Inference compute matters too — a smaller overtrained model may outperform a compute-optimal larger model while being cheaper to serve
- Phase transitions: a capability is absent at small scale, then appears sharply at a threshold
- Examples: in-context learning, chain-of-thought reasoning, multi-step arithmetic
- Debated: some argue these are measurement artifacts of benchmark thresholds
- Some tasks improve smoothly with scale (MMLU)
- Others show emergent threshold behavior (GSM8K, BIG-Bench Hard)
- Quality of training data can shift the scaling curve significantly
- Quality > Quantity: high-quality data (books, Wikipedia) is worth more per token than web crawl
- Diversity: model capabilities track data diversity
- Repetition hurts: seeing the same data >1–3 times degrades performance
- Data mixture matters: the proportion of code, math, multilingual data in training shapes capabilities
- Optimal model size: ~13B parameters
- Optimal training tokens: ~260B tokens
- Pre-training, Parameters, Compute, Chinchilla, Emergent Abilities, Model Selection, Loss Function
The Two Landmark Papers
1. Kaplan et al. (OpenAI, 2020) — "Scaling Laws for Neural Language Models"
Key findings:
`
L(N) ∝ N^(-0.076) (model size scaling)
L(D) ∝ D^(-0.095) (data size scaling)
L(C) ∝ C^(-0.050) (compute scaling)
`
2. Chinchilla (Hoffmann et al., DeepMind, 2022) — "Training Compute-Optimal Large Language Models"
Corrected Kaplan's recommendation:
Key finding: For a fixed compute budget, parameters and tokens should scale equally.
Chinchilla formula:
`
N_optimal = C^0.49 × 0.56
D_optimal = C^0.51 × 1.78
Approximately: D_optimal ≈ 20 × N_optimal
`
Practical Implications of Scaling Laws
For Model Training
| Decision | Guidance from Scaling Laws |
|----------|---------------------------|
| Model size | Larger is better, but must be matched with enough data |
| Data quantity | ~20 tokens/param minimum; frontier labs use 10–100× more |
| Training duration | Don't stop early — more steps = lower loss |
| Compute budget | Split roughly equally between model size and data |
For Inference Efficiency
After Chinchilla, the field shifted to overtrained smaller models:
Emergent Abilities and Scaling Laws
Some capabilities don't follow smooth power laws — they appear abruptly:
Beyond Loss: Downstream Task Scaling
Scaling laws on perplexity (language modeling loss) correlate with downstream task performance — but not perfectly:
Data Scaling Laws
Compute Scaling (Chinchilla Formula in Practice)
For a training budget of C FLOPs:
`
Optimal N (parameters) ≈ (C / 6)^0.5
Optimal D (tokens) ≈ (C × 6)^0.5 × 20
`
Example: 10^23 FLOPs budget
The "Compute-Optimal" vs. "Inference-Optimal" Distinction
| Approach | Model Size | Tokens | Result |
|----------|-----------|--------|--------|
| Compute-optimal | Large | Fewer | Best performance per training FLOP |
| Inference-optimal | Small | Many (overtrained) | Best performance per inference FLOP |
Frontier labs are shifting from compute-optimal to inference-optimal training as deployment costs dominate.
Scaling Laws Limitations
| Limitation | Notes |
|------------|-------|
| Architecture changes reset the curve | New architectures (MoE, Mamba) shift the power law |
| Data quality is not accounted for | Laws assume uniform quality corpora |
| Benchmark saturation | At some scale, benchmarks max out |
| Emergent abilities are discontinuous | Not everything follows a smooth power law |
| Don't predict reasoning ability directly | GPT-4 quality jump required more than just scale |