Advanced·5 min read

LLM-as-Judge

LLM-as-Judge is an evaluation technique where a language model is used to assess the quality of another language model's outputs — acting as an automa

Definition

LLM-as-Judge is an evaluation technique where a language model is used to assess the quality of another language model's outputs — acting as an automated evaluator. It enables scalable, nuanced quality measurement for tasks where ground truth is subjective or hard to define.

Why LLM-as-Judge?

Human evaluation is the gold standard but doesn't scale:

  • Expensive: $5–$50 per evaluated response
  • Slow: days to weeks for large eval sets
  • Inconsistent: different human raters disagree
  • Not continuous: can't run after every code commit
  • Rule-based metrics (BLEU, ROUGE, exact match) can't assess quality:

  • "Paris is the capital of France" ≠ "Paris is France's capital" (different strings, same answer)
  • Can't evaluate helpfulness, coherence, tone, or safety with rules
  • LLM-as-Judge is scalable (~$0.01 per eval), fast (seconds), and captures nuanced quality dimensions.

    Judge Prompt Patterns

    Pointwise Scoring

    Rate a single response on a scale:

    `

    System: "You are an expert evaluator."

    User: """

    Question: {question}

    Response: {response}

    Rate this response on:

  • Accuracy (1-5): Is the information correct?
  • Helpfulness (1-5): Does it address the question?
  • Conciseness (1-5): Is it appropriately brief?
  • Explain your ratings, then provide scores as JSON:

    {"accuracy": X, "helpfulness": X, "conciseness": X}

    """

    `

    Pairwise Comparison (MT-Bench style)

    Compare two responses and pick the better one:

    `

    User: """

    Question: {question}

    Response A: {response_a}

    Response B: {response_b}

    Which response is better? Consider accuracy, helpfulness, and clarity.

    Answer with "A", "B", or "tie" and briefly explain why.

    """

    `

    Reference-Based Scoring

    Compare against a reference/gold answer:

    `

    User: """

    Question: {question}

    Reference Answer: {reference}

    Model Response: {response}

    Does the model response capture the key information in the reference answer?

    Score: 1 (missing key info) to 5 (captures everything)

    """

    `

    Binary Pass/Fail

    Simple yes/no quality check:

    `

    User: """

    Task: Summarize the document in 3 bullet points.

    Document: {document}

    Response: {response}

    Does the response contain exactly 3 bullet points?

    Does each bullet point come from the document content?

    Answer: pass/fail with explanation.

    """

    `

    Choosing the Judge Model

  • Judge should be equal or stronger than the model being evaluated
  • Using a weaker model as judge produces unreliable scores
  • Common: GPT-4o or Claude Opus as judge for GPT-3.5 or smaller models
  • Using a model to judge itself → sycophancy/self-preference bias
  • Judge Bias and Mitigation

    Position Bias

    Judges tend to prefer the first response in pairwise comparisons:

  • Mitigation: run each pair twice, swapping A/B positions; average results
  • Verbosity Bias

    Judges tend to prefer longer responses, even when shorter ones are better:

  • Mitigation: explicitly instruct the judge to evaluate content quality, not length
  • Self-Enhancement Bias

    A model rates its own outputs higher:

  • Mitigation: never use a model to evaluate its own outputs
  • Instruction-Following Bias

    Some judges rate responses that superficially follow instructions higher:

  • Mitigation: include ground truth or reference answers in the judge prompt
  • Calibrating Your Judge

    Before relying on LLM-as-Judge, validate it:

    1. Create a small calibration set (50–100 examples) with human labels

    2. Compare judge scores to human labels

    3. Measure correlation (Cohen's Kappa, Pearson/Spearman)

    4. If correlation > 0.8: judge is reliable

    5. If correlation < 0.6: judge prompt needs improvement

    Multi-Judge Ensembling

    Use multiple judges and aggregate to reduce single-judge variance:

    `python

    scores = []

    for judge_model in ["gpt-4o", "claude-opus", "gemini-ultra"]:

    score = judge_model.evaluate(question, response)

    scores.append(score)

    final_score = average(scores) # or majority vote

    `

    LLM-as-Judge in Eval Pipelines

    `python

    RAGAS example: RAG faithfulness evaluation

    from ragas import evaluate

    from ragas.metrics import faithfulness, answer_relevancy, context_recall

    results = evaluate(

    dataset=test_dataset,

    metrics=[faithfulness, answer_relevancy, context_recall]

    )

    Internally uses LLM-as-Judge for each metric

    `

    LLM-as-Judge for RLHF/DPO

    A key application: RLAIF (RL from AI Feedback):

    1. Generate multiple model responses for each prompt

    2. Use a strong judge model to rank/score them

    3. Use the rankings as preference data for DPO/RLHF

    4. Scales feedback collection without human bottleneck

    Used by: Anthropic (Constitutional AI), many open-source RLHF pipelines.

    Limitations

    | Limitation | Mitigation |

    |------------|-----------|

    | Judge model has its own biases | Calibrate against human labels |

    | Cost for large eval sets | Use smaller judge model for bulk, larger for validation |

    | Can't evaluate factual accuracy reliably | Combine with RAG-based fact checking |

    | Sycophancy: judge agrees with authoritative-sounding responses | Include adversarial examples in calibration |

    | Sensitive to judge prompt wording | Test multiple phrasings |

    When to Use LLM-as-Judge

    | Good Use Cases | Poor Use Cases |

    |---------------|----------------|

    | Helpfulness / tone / clarity | Exact factual verification |

    | Instruction following | Code correctness (run the code!) |

    | Response coherence | Numerical accuracy |

    | Safety screening | Legal compliance verification |

    | Format adherence | Benchmark tasks with ground truth |

    Related Concepts

  • Evals, RLHF, DPO, RLAIF, Alignment, Benchmarks, RAG, Hallucination

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 13).