Definition

LLM-as-Judge is an evaluation technique where a language model is used to assess the quality of another language model's outputs — acting as an automated evaluator. It enables scalable, nuanced quality measurement for tasks where ground truth is subjective or hard to define.

Why LLM-as-Judge?

Human evaluation is the gold standard but doesn't scale:

Expensive: $5–$50 per evaluated response
Slow: days to weeks for large eval sets
Inconsistent: different human raters disagree
Not continuous: can't run after every code commit

Rule-based metrics (BLEU, ROUGE, exact match) can't assess quality:

"Paris is the capital of France" ≠ "Paris is France's capital" (different strings, same answer)
Can't evaluate helpfulness, coherence, tone, or safety with rules

LLM-as-Judge is scalable (~$0.01 per eval), fast (seconds), and captures nuanced quality dimensions.

Judge Prompt Patterns

Pointwise Scoring

Rate a single response on a scale:

System: "You are an expert evaluator."

User: """

Question: {question}

Response: {response}

Rate this response on:

Accuracy (1-5): Is the information correct?
Helpfulness (1-5): Does it address the question?
Conciseness (1-5): Is it appropriately brief?

Explain your ratings, then provide scores as JSON:

{"accuracy": X, "helpfulness": X, "conciseness": X}

"""

Pairwise Comparison (MT-Bench style)

Compare two responses and pick the better one:

User: """

Question: {question}

Response A: {response_a}

Response B: {response_b}

Which response is better? Consider accuracy, helpfulness, and clarity.

Answer with "A", "B", or "tie" and briefly explain why.

"""

Reference-Based Scoring

Compare against a reference/gold answer:

User: """

Question: {question}

Reference Answer: {reference}

Model Response: {response}

Does the model response capture the key information in the reference answer?

Score: 1 (missing key info) to 5 (captures everything)

"""

Binary Pass/Fail

Simple yes/no quality check:

User: """

Task: Summarize the document in 3 bullet points.

Document: {document}

Response: {response}

Does the response contain exactly 3 bullet points?

Does each bullet point come from the document content?

Answer: pass/fail with explanation.

"""

Choosing the Judge Model

Judge should be equal or stronger than the model being evaluated
Using a weaker model as judge produces unreliable scores
Common: GPT-4o or Claude Opus as judge for GPT-3.5 or smaller models
Using a model to judge itself → sycophancy/self-preference bias

Judge Bias and Mitigation

Position Bias

Judges tend to prefer the first response in pairwise comparisons:

Mitigation: run each pair twice, swapping A/B positions; average results

Verbosity Bias

Judges tend to prefer longer responses, even when shorter ones are better:

Mitigation: explicitly instruct the judge to evaluate content quality, not length

Self-Enhancement Bias

A model rates its own outputs higher:

Mitigation: never use a model to evaluate its own outputs

Instruction-Following Bias

Some judges rate responses that superficially follow instructions higher:

Mitigation: include ground truth or reference answers in the judge prompt

Calibrating Your Judge

Before relying on LLM-as-Judge, validate it:

1. Create a small calibration set (50–100 examples) with human labels

2. Compare judge scores to human labels

3. Measure correlation (Cohen's Kappa, Pearson/Spearman)

4. If correlation > 0.8: judge is reliable

5. If correlation < 0.6: judge prompt needs improvement

Multi-Judge Ensembling

Use multiple judges and aggregate to reduce single-judge variance:

`python

scores = []

for judge_model in ["gpt-4o", "claude-opus", "gemini-ultra"]:

score = judge_model.evaluate(question, response)

scores.append(score)

final_score = average(scores) # or majority vote

LLM-as-Judge in Eval Pipelines

`python

RAGAS example: RAG faithfulness evaluation

from ragas import evaluate

from ragas.metrics import faithfulness, answer_relevancy, context_recall

results = evaluate(

dataset=test_dataset,

metrics=[faithfulness, answer_relevancy, context_recall]

)

Internally uses LLM-as-Judge for each metric

LLM-as-Judge for RLHF/DPO

A key application: RLAIF (RL from AI Feedback):

1. Generate multiple model responses for each prompt

2. Use a strong judge model to rank/score them

3. Use the rankings as preference data for DPO/RLHF

4. Scales feedback collection without human bottleneck

Used by: Anthropic (Constitutional AI), many open-source RLHF pipelines.

Limitations

| Limitation | Mitigation |

|------------|-----------|

| Judge model has its own biases | Calibrate against human labels |

| Cost for large eval sets | Use smaller judge model for bulk, larger for validation |

| Can't evaluate factual accuracy reliably | Combine with RAG-based fact checking |

| Sycophancy: judge agrees with authoritative-sounding responses | Include adversarial examples in calibration |

| Sensitive to judge prompt wording | Test multiple phrasings |

When to Use LLM-as-Judge

| Good Use Cases | Poor Use Cases |

|---------------|----------------|

| Helpfulness / tone / clarity | Exact factual verification |

| Instruction following | Code correctness (run the code!) |

| Response coherence | Numerical accuracy |

| Safety screening | Legal compliance verification |

| Format adherence | Benchmark tasks with ground truth |

Related Concepts

Evals, RLHF, DPO, RLAIF, Alignment, Benchmarks, RAG, Hallucination

Definition

Why LLM-as-Judge?

Judge Prompt Patterns

Pointwise Scoring

Pairwise Comparison (MT-Bench style)

Reference-Based Scoring

Binary Pass/Fail

Choosing the Judge Model

Judge Bias and Mitigation

Position Bias

Verbosity Bias

Self-Enhancement Bias

Instruction-Following Bias

Calibrating Your Judge

Multi-Judge Ensembling

LLM-as-Judge in Eval Pipelines

RAGAS example: RAG faithfulness evaluation

Internally uses LLM-as-Judge for each metric

LLM-as-Judge for RLHF/DPO

Limitations

When to Use LLM-as-Judge

Related Concepts

Go Deeper With Live Instruction