Scaling Laws — FDE@ProdAI Blog

Definition

Scaling laws are empirical relationships that describe how LLM performance (measured by loss) improves predictably and smoothly as a function of three resources: model size (parameters), training data (tokens), and compute (FLOPs). They allow researchers to forecast model capability before training, and to optimally allocate a compute budget.

Why Scaling Laws Matter

Before scaling laws, building better models was trial-and-error. Scaling laws revealed:

Bigger models with more data = predictably better, following a power law
You can extrapolate small-model training runs to predict large-model performance
There are optimal ratios between model size and data for a given compute budget
This predictability enabled the "just scale it" approach that produced GPT-3, GPT-4, and beyond

The Two Landmark Papers

1. Kaplan et al. (OpenAI, 2020) — "Scaling Laws for Neural Language Models"

Key findings:

Loss scales as a power law in N (parameters), D (data tokens), and C (compute)
Performance improves smoothly and predictably with scale — no abrupt phase changes (mostly)
Larger models are more sample-efficient — they learn more from each token
Implication: given a fixed compute budget, use the largest model possible even with less data

L(N) ∝ N^(-0.076) (model size scaling)

L(D) ∝ D^(-0.095) (data size scaling)

L(C) ∝ C^(-0.050) (compute scaling)

2. Chinchilla (Hoffmann et al., DeepMind, 2022) — "Training Compute-Optimal Large Language Models"

Corrected Kaplan's recommendation:

Key finding: For a fixed compute budget, parameters and tokens should scale equally.

Rule of thumb: ~20 tokens per parameter for compute-optimal training
GPT-3 (175B params, 300B tokens) was significantly undertrained
Chinchilla (70B params, 1.4T tokens) outperformed Gopher (280B params) using same compute

Chinchilla formula:

N_optimal = C^0.49 × 0.56

D_optimal = C^0.51 × 1.78

Approximately: D_optimal ≈ 20 × N_optimal

Practical Implications of Scaling Laws

For Model Training

| Decision | Guidance from Scaling Laws |

|----------|---------------------------|

| Model size | Larger is better, but must be matched with enough data |

| Data quantity | ~20 tokens/param minimum; frontier labs use 10–100× more |

| Training duration | Don't stop early — more steps = lower loss |

| Compute budget | Split roughly equally between model size and data |

For Inference Efficiency

After Chinchilla, the field shifted to overtrained smaller models:

Train a smaller model on far more tokens than compute-optimal
Result: the model is "overtrained" for its size, but more efficient at inference
Examples: LLaMA (7B trained on 1–2T tokens >> 140B optimal), Mistral 7B
Inference compute matters too — a smaller overtrained model may outperform a compute-optimal larger model while being cheaper to serve

Emergent Abilities and Scaling Laws

Some capabilities don't follow smooth power laws — they appear abruptly:

Phase transitions: a capability is absent at small scale, then appears sharply at a threshold
Examples: in-context learning, chain-of-thought reasoning, multi-step arithmetic
Debated: some argue these are measurement artifacts of benchmark thresholds

Beyond Loss: Downstream Task Scaling

Scaling laws on perplexity (language modeling loss) correlate with downstream task performance — but not perfectly:

Some tasks improve smoothly with scale (MMLU)
Others show emergent threshold behavior (GSM8K, BIG-Bench Hard)
Quality of training data can shift the scaling curve significantly

Data Scaling Laws

Quality > Quantity: high-quality data (books, Wikipedia) is worth more per token than web crawl
Diversity: model capabilities track data diversity
Repetition hurts: seeing the same data >1–3 times degrades performance
Data mixture matters: the proportion of code, math, multilingual data in training shapes capabilities

Compute Scaling (Chinchilla Formula in Practice)

For a training budget of C FLOPs:

Optimal N (parameters) ≈ (C / 6)^0.5

Optimal D (tokens) ≈ (C × 6)^0.5 × 20

Example: 10^23 FLOPs budget

Optimal model size: ~13B parameters
Optimal training tokens: ~260B tokens

The "Compute-Optimal" vs. "Inference-Optimal" Distinction

|----------|-----------|--------|--------|

Frontier labs are shifting from compute-optimal to inference-optimal training as deployment costs dominate.

Scaling Laws Limitations

| Limitation | Notes |

|------------|-------|

| Architecture changes reset the curve | New architectures (MoE, Mamba) shift the power law |

| Data quality is not accounted for | Laws assume uniform quality corpora |

| Benchmark saturation | At some scale, benchmarks max out |

| Emergent abilities are discontinuous | Not everything follows a smooth power law |

| Don't predict reasoning ability directly | GPT-4 quality jump required more than just scale |

Related Concepts

Pre-training, Parameters, Compute, Chinchilla, Emergent Abilities, Model Selection, Loss Function