Definition
Top-K and Top-P (nucleus sampling) are token filtering strategies applied after temperature scaling that restrict which tokens can be sampled as the next output — preventing the model from selecting very low-probability, incoherent tokens while preserving diversity.
The Problem They Solve
After temperature scaling, the full vocabulary distribution may include thousands of tokens with small but non-zero probabilities. Sampling from the full distribution can produce random, incoherent words. Top-K and Top-P restrict sampling to the "reasonable" candidates.
Top-K Sampling
Keep the K tokens with the highest probability; discard all others.
`
1. Sort tokens by probability (descending)
2. Keep only the top K tokens
3. Set all other probabilities to 0
4. Re-normalize the remaining K probabilities to sum to 1
5. Sample from the truncated distribution
`
Example with K=3:
`
Before: Paris(0.50), France(0.25), Lyon(0.12), dog(0.08), car(0.03), ...
After: Paris(0.576), France(0.288), Lyon(0.138) [re-normalized]
`
Problem with Top-K: K is fixed, but the natural distribution width varies:
- When the model is confident (one obvious answer), K=50 still lets in unlikely tokens
- When the model is uncertain (many valid options), K=50 may cut off valid choices
- Relative threshold: adapts like Top-P but from the top down
- min_p = 0.05 is a common default
- Gaining adoption in open-source inference (llama.cpp, Ollama)
- Temperature, Logits and Softmax, Inference, Greedy Decoding, Sampling, Beam Search
Top-P (Nucleus Sampling)
Keep the smallest set of tokens whose cumulative probability ≥ P.
`
1. Sort tokens by probability (descending)
2. Accumulate probabilities until sum ≥ P
3. Keep only those tokens
4. Discard the rest (zero them out)
5. Re-normalize and sample
`
Example with P=0.9:
`
Confident case:
Paris(0.85), France(0.06) → cumsum = 0.91 ≥ 0.9 → keep 2 tokens
[Tight nucleus: prevents unlikely tokens]
Uncertain case:
"she"(0.08), "it"(0.07), "he"(0.07), "they"(0.06)... → need 20 tokens to reach 0.9
[Wide nucleus: allows diversity]
`
Advantage: Top-P adapts dynamically to the model's confidence.
Comparison
| Aspect | Top-K | Top-P |
|--------|-------|-------|
| Pool size | Fixed K tokens | Variable — depends on distribution |
| Adapts to confidence | No | Yes |
| Behavior when certain | May include bad tokens | Tight nucleus |
| Behavior when uncertain | May exclude valid tokens | Wide nucleus |
| Default recommendation | Less preferred | Preferred (more adaptive) |
| Typical value | K=40–100 | P=0.9–0.95 |
Using Both Together
Top-K and Top-P can be combined:
1. Apply Top-K first → restricts to at most K tokens
2. Apply Top-P second → further restricts to cumulative P
3. Sample from the intersection
This prevents extremely wide nuclei while maintaining adaptability.
Parameter Interaction with Temperature
The full sampling pipeline:
`
logits → divide by T (temperature) → softmax → Top-K filter → Top-P filter → re-normalize → sample
`
Combined recommendations:
| Use Case | Temperature | Top-P | Top-K |
|----------|------------|-------|-------|
| Factual/code | 0.0–0.2 | 1.0 | 1 (greedy) |
| General chat | 0.7–1.0 | 0.9 | 50 |
| Creative writing | 1.0–1.2 | 0.95 | 100 |
| Brainstorming | 1.2–1.5 | 1.0 | 100 |
Min-P Sampling (Newer Alternative)
Filters out tokens whose probability is below min_p × max_probability:
`
threshold = min_p × p_max_token
keep only tokens where p_i ≥ threshold
`
Greedy vs. Sampling vs. Beam Search
| Strategy | Description | Deterministic? |
|----------|-------------|----------------|
| Greedy (T=0) | Always pick highest-prob token | Yes |
| Top-K/Top-P sampling | Sample from filtered distribution | No |
| Beam search | Maintain B candidates, pick best sequence | Yes (for B=1, same as greedy) |
API Defaults (2024)
| Platform | Default Temperature | Default Top-P | Default Top-K |
|----------|-------------------|---------------|---------------|
| OpenAI | 1.0 | 1.0 (disabled) | Not exposed |
| Anthropic | 1.0 | Not default | Not default |
| AWS Bedrock | Model-dependent | Model-dependent | Model-dependent |
| Ollama | 0.8 | 0.9 | 40 |
Note: using both temperature=1 and top_p=1 means full-distribution sampling (no restriction).