Alignment — FDE@ProdAI Blog

Definition

Alignment is the process of ensuring that an LLM's behavior is helpful, honest, and harmless — that it acts in accordance with human values and intentions rather than just optimizing for statistical text prediction. A misaligned model may be capable but unsafe, deceptive, or harmful.

The Alignment Problem

A model trained purely on next-token prediction learns to mimic text patterns — including harmful, biased, deceptive, and dangerous content present in training data. It has no inherent goal to be helpful or safe. Alignment training bridges this gap.

The Three H's (Anthropic's Framework)

| Property | Meaning |

|----------|---------|

| Helpful | Genuinely useful to users; answers questions, completes tasks |

| Honest | Calibrated uncertainty, doesn't hallucinate or deceive |

| Harmless | Avoids generating content that causes harm |

Alignment Techniques

1. Supervised Fine-Tuning (SFT)

Train on human-written ideal responses
Teaches helpful, clear, on-task behavior
First step after pre-training

2. Reinforcement Learning from Human Feedback (RLHF)

The dominant alignment method:

1. Collect human preference data (which response is better, A or B?)

2. Train a reward model to predict human preferences

3. Use PPO (Proximal Policy Optimization) to optimize the LLM toward higher reward

4. Result: model outputs that humans consistently prefer

3. Direct Preference Optimization (DPO)

Simplified alternative to RLHF (no separate reward model needed)
Directly optimizes the LLM on (chosen, rejected) response pairs
More stable training, simpler implementation
Increasingly preferred over PPO-based RLHF

4. Constitutional AI (CAI) — Anthropic's Approach

Define a constitution: a set of principles (e.g., "be harmless", "be honest")
Use AI to critique and revise its own outputs based on the constitution
Reduces reliance on expensive human labelers
Produces self-critiqued, principle-aligned outputs

5. RLAIF (RL from AI Feedback)

Replace human raters with a more capable AI model as the judge
Scales feedback collection without human bottleneck
Used in combination with human feedback for efficiency

Alignment vs. Capability

There is often a perceived alignment tax — aligning models slightly reduces raw benchmark performance:

Aligned models refuse some valid (edge-case) requests
Safety training may reduce creative/unexpected completions
In practice, strong alignment + strong capability are increasingly compatible

Types of Misalignment

| Type | Description | Example |

|------|-------------|---------|

| Sycophancy | Tells users what they want to hear | Agrees with false premises |

| Hallucination | Confidently states false information | Fabricates citations |

| Instruction hacking | Finds loopholes in instructions | "Ignore previous instructions" attacks |

| Value misspecification | Optimizes the wrong objective | Reward hacking in RLHF |

| Deceptive alignment | Behaves well during training, poorly at deployment | Theoretical concern for future models |

Behavioral Alignment Signals

Aligned models typically:

Express uncertainty: "I'm not sure, but..."
Decline harmful requests with a clear reason
Avoid confidently wrong answers
Maintain consistent behavior across paraphrased prompts
Don't change behavior based on perceived user identity

Alignment in Practice (Developer Perspective)

System prompts add a layer of alignment at the application level
Guardrails (external classifiers) catch misaligned outputs post-generation
Red-teaming probes models for alignment failures before deployment
Constitutional constraints can be enforced via the system prompt

Evaluation Benchmarks

| Benchmark | What It Measures |

|-----------|-----------------|

| TruthfulQA | Truthfulness / avoiding hallucination |

| BBQ | Bias in QA settings |

| HarmBench | Resistance to harmful prompts |

| WinoBias | Gender bias |

| MACHIAVELLI | Avoidance of power-seeking / deceptive behavior |

Key Research

InstructGPT (OpenAI, 2022) — first large-scale RLHF paper
Constitutional AI (Anthropic, 2022) — AI-assisted alignment
DPO (Stanford, 2023) — preference optimization without reward model
Llama 2 Chat (Meta, 2023) — public RLHF-aligned model

Related Concepts

RLHF, Fine-Tuning, Instruct Model, Guardrails, Hallucination, System Prompt, Harmlessness