Definition
Alignment is the process of ensuring that an LLM's behavior is helpful, honest, and harmless — that it acts in accordance with human values and intentions rather than just optimizing for statistical text prediction. A misaligned model may be capable but unsafe, deceptive, or harmful.
The Alignment Problem
A model trained purely on next-token prediction learns to mimic text patterns — including harmful, biased, deceptive, and dangerous content present in training data. It has no inherent goal to be helpful or safe. Alignment training bridges this gap.
The Three H's (Anthropic's Framework)
| Property | Meaning |
|----------|---------|
| Helpful | Genuinely useful to users; answers questions, completes tasks |
| Honest | Calibrated uncertainty, doesn't hallucinate or deceive |
| Harmless | Avoids generating content that causes harm |
Alignment Techniques
1. Supervised Fine-Tuning (SFT)
- Train on human-written ideal responses
- Teaches helpful, clear, on-task behavior
- First step after pre-training
- Simplified alternative to RLHF (no separate reward model needed)
- Directly optimizes the LLM on (chosen, rejected) response pairs
- More stable training, simpler implementation
- Increasingly preferred over PPO-based RLHF
- Define a constitution: a set of principles (e.g., "be harmless", "be honest")
- Use AI to critique and revise its own outputs based on the constitution
- Reduces reliance on expensive human labelers
- Produces self-critiqued, principle-aligned outputs
- Replace human raters with a more capable AI model as the judge
- Scales feedback collection without human bottleneck
- Used in combination with human feedback for efficiency
- Aligned models refuse some valid (edge-case) requests
- Safety training may reduce creative/unexpected completions
- In practice, strong alignment + strong capability are increasingly compatible
- Express uncertainty: "I'm not sure, but..."
- Decline harmful requests with a clear reason
- Avoid confidently wrong answers
- Maintain consistent behavior across paraphrased prompts
- Don't change behavior based on perceived user identity
- System prompts add a layer of alignment at the application level
- Guardrails (external classifiers) catch misaligned outputs post-generation
- Red-teaming probes models for alignment failures before deployment
- Constitutional constraints can be enforced via the system prompt
- InstructGPT (OpenAI, 2022) — first large-scale RLHF paper
- Constitutional AI (Anthropic, 2022) — AI-assisted alignment
- DPO (Stanford, 2023) — preference optimization without reward model
- Llama 2 Chat (Meta, 2023) — public RLHF-aligned model
- RLHF, Fine-Tuning, Instruct Model, Guardrails, Hallucination, System Prompt, Harmlessness
2. Reinforcement Learning from Human Feedback (RLHF)
The dominant alignment method:
1. Collect human preference data (which response is better, A or B?)
2. Train a reward model to predict human preferences
3. Use PPO (Proximal Policy Optimization) to optimize the LLM toward higher reward
4. Result: model outputs that humans consistently prefer
3. Direct Preference Optimization (DPO)
4. Constitutional AI (CAI) — Anthropic's Approach
5. RLAIF (RL from AI Feedback)
Alignment vs. Capability
There is often a perceived alignment tax — aligning models slightly reduces raw benchmark performance:
Types of Misalignment
| Type | Description | Example |
|------|-------------|---------|
| Sycophancy | Tells users what they want to hear | Agrees with false premises |
| Hallucination | Confidently states false information | Fabricates citations |
| Instruction hacking | Finds loopholes in instructions | "Ignore previous instructions" attacks |
| Value misspecification | Optimizes the wrong objective | Reward hacking in RLHF |
| Deceptive alignment | Behaves well during training, poorly at deployment | Theoretical concern for future models |
Behavioral Alignment Signals
Aligned models typically:
Alignment in Practice (Developer Perspective)
Evaluation Benchmarks
| Benchmark | What It Measures |
|-----------|-----------------|
| TruthfulQA | Truthfulness / avoiding hallucination |
| BBQ | Bias in QA settings |
| HarmBench | Resistance to harmful prompts |
| WinoBias | Gender bias |
| MACHIAVELLI | Avoidance of power-seeking / deceptive behavior |