Intermediate·4 min read

Alignment

Alignment is the process of ensuring that an LLM's behavior is helpful, honest, and harmless — that it acts in accordance with human values and intent

Definition

Alignment is the process of ensuring that an LLM's behavior is helpful, honest, and harmless — that it acts in accordance with human values and intentions rather than just optimizing for statistical text prediction. A misaligned model may be capable but unsafe, deceptive, or harmful.

The Alignment Problem

A model trained purely on next-token prediction learns to mimic text patterns — including harmful, biased, deceptive, and dangerous content present in training data. It has no inherent goal to be helpful or safe. Alignment training bridges this gap.

The Three H's (Anthropic's Framework)

| Property | Meaning |

|----------|---------|

| Helpful | Genuinely useful to users; answers questions, completes tasks |

| Honest | Calibrated uncertainty, doesn't hallucinate or deceive |

| Harmless | Avoids generating content that causes harm |

Alignment Techniques

1. Supervised Fine-Tuning (SFT)

  • Train on human-written ideal responses
  • Teaches helpful, clear, on-task behavior
  • First step after pre-training
  • 2. Reinforcement Learning from Human Feedback (RLHF)

    The dominant alignment method:

    1. Collect human preference data (which response is better, A or B?)

    2. Train a reward model to predict human preferences

    3. Use PPO (Proximal Policy Optimization) to optimize the LLM toward higher reward

    4. Result: model outputs that humans consistently prefer

    3. Direct Preference Optimization (DPO)

  • Simplified alternative to RLHF (no separate reward model needed)
  • Directly optimizes the LLM on (chosen, rejected) response pairs
  • More stable training, simpler implementation
  • Increasingly preferred over PPO-based RLHF
  • 4. Constitutional AI (CAI) — Anthropic's Approach

  • Define a constitution: a set of principles (e.g., "be harmless", "be honest")
  • Use AI to critique and revise its own outputs based on the constitution
  • Reduces reliance on expensive human labelers
  • Produces self-critiqued, principle-aligned outputs
  • 5. RLAIF (RL from AI Feedback)

  • Replace human raters with a more capable AI model as the judge
  • Scales feedback collection without human bottleneck
  • Used in combination with human feedback for efficiency
  • Alignment vs. Capability

    There is often a perceived alignment tax — aligning models slightly reduces raw benchmark performance:

  • Aligned models refuse some valid (edge-case) requests
  • Safety training may reduce creative/unexpected completions
  • In practice, strong alignment + strong capability are increasingly compatible
  • Types of Misalignment

    | Type | Description | Example |

    |------|-------------|---------|

    | Sycophancy | Tells users what they want to hear | Agrees with false premises |

    | Hallucination | Confidently states false information | Fabricates citations |

    | Instruction hacking | Finds loopholes in instructions | "Ignore previous instructions" attacks |

    | Value misspecification | Optimizes the wrong objective | Reward hacking in RLHF |

    | Deceptive alignment | Behaves well during training, poorly at deployment | Theoretical concern for future models |

    Behavioral Alignment Signals

    Aligned models typically:

  • Express uncertainty: "I'm not sure, but..."
  • Decline harmful requests with a clear reason
  • Avoid confidently wrong answers
  • Maintain consistent behavior across paraphrased prompts
  • Don't change behavior based on perceived user identity
  • Alignment in Practice (Developer Perspective)

  • System prompts add a layer of alignment at the application level
  • Guardrails (external classifiers) catch misaligned outputs post-generation
  • Red-teaming probes models for alignment failures before deployment
  • Constitutional constraints can be enforced via the system prompt
  • Evaluation Benchmarks

    | Benchmark | What It Measures |

    |-----------|-----------------|

    | TruthfulQA | Truthfulness / avoiding hallucination |

    | BBQ | Bias in QA settings |

    | HarmBench | Resistance to harmful prompts |

    | WinoBias | Gender bias |

    | MACHIAVELLI | Avoidance of power-seeking / deceptive behavior |

    Key Research

  • InstructGPT (OpenAI, 2022) — first large-scale RLHF paper
  • Constitutional AI (Anthropic, 2022) — AI-assisted alignment
  • DPO (Stanford, 2023) — preference optimization without reward model
  • Llama 2 Chat (Meta, 2023) — public RLHF-aligned model
  • Related Concepts

  • RLHF, Fine-Tuning, Instruct Model, Guardrails, Hallucination, System Prompt, Harmlessness

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 3).