Definition
Inference is the process of generating output tokens from a trained LLM. It is the "prediction" phase — using the frozen, trained model weights to produce responses given an input prompt. Inference is everything that happens after training.
The Autoregressive Generation Loop
LLMs generate text one token at a time, feeding each generated token back as input for the next prediction:
`
Step 1: Prompt → [Model] → token_1
Step 2: Prompt + token_1 → [Model] → token_2
Step 3: Prompt + token_1 + token_2 → [Model] → token_3
...
Until: [EOS token] or [max_tokens limit reached]
`
Each step is one forward pass through the full model.
Two Phases of Inference
Prefill Phase
- Process the entire input prompt in one batch (parallelizable)
- Compute key-value (KV) pairs for all input tokens
- Store KV cache for reuse in decode phase
- Compute-bound (GPU utilization high)
- Generate one token per step
- Each step uses the cached KV from prefill + all previously generated tokens
- Memory-bandwidth bound (GPU reads KV cache repeatedly)
- Sequential — cannot be fully parallelized
- Always pick the highest-probability token
- Deterministic (same input → same output every time)
- Fast, but often produces repetitive, bland text
- Divide logits by temperature T before softmax
- T < 1.0: sharpens distribution → more focused/deterministic
- T > 1.0: flattens distribution → more random/creative
- T = 0: equivalent to greedy
- Restrict sampling to the top K most probable tokens
- Discard the rest, renormalize, then sample
- K = 1: greedy; K = 50: balanced variety
- Restrict to the smallest set of tokens whose cumulative probability ≥ P
- Dynamic K — adapts to confidence (when model is confident, fewer tokens qualify)
- P = 0.9: common default; P = 1.0: full distribution
- Maintain B "beams" (candidate sequences) simultaneously
- At each step, expand all beams and keep top B continuations
- Final answer: highest-scoring beam
- More thorough than greedy, but slower and can produce generic text
- Filter out tokens with probability below min_p × (probability of the top token)
- Relatively new, gaining adoption for quality outputs
- During prefill, all K and V matrices for each attention layer are computed and stored
- During decode, each new token only needs to compute its own K, V and attend to the cache
- Without KV cache, each decode step would recompute all previous tokens — O(n²) work
- With KV cache: O(n) per new token
- KV cache memory:
2 × layers × heads × head_dim × seq_len × batch_size × bytes - Lower perceived latency (first token arrives quickly)
- Better user experience for long responses
- Supported by all major APIs (SSE / WebSocket)
- vLLM — high-throughput server, PagedAttention
- TGI (Text Generation Inference) — HuggingFace's production server
- Ollama — local inference, user-friendly
- llama.cpp — CPU/GPU inference, GGUF quantized models
- TensorRT-LLM — NVIDIA's optimized inference framework
- Token, Context Window, Temperature, Sampling, KV Cache, Latency, Quantization, Streaming
Decode Phase (Autoregressive)
Sampling Strategies
The model outputs a probability distribution over the vocabulary. Sampling determines how the next token is chosen:
Greedy Decoding
Temperature Sampling
Top-K Sampling
Top-P (Nucleus) Sampling
Beam Search
Min-P Sampling
Key Inference Parameters
| Parameter | Effect | Typical Range |
|-----------|--------|--------------|
| temperature | Randomness of sampling | 0.0–2.0 (default: 0.7–1.0) |
| top_p | Nucleus sampling threshold | 0.0–1.0 (default: 0.9) |
| top_k | Top-K candidate pool | 1–100+ (default: 50) |
| max_tokens | Maximum output length | 1–100K+ |
| stop | Token sequences that end generation | Custom strings |
| frequency_penalty | Reduce repetition | 0.0–2.0 |
| presence_penalty | Encourage topic diversity | 0.0–2.0 |
KV Cache
The Key-Value (KV) cache is critical for inference efficiency:
Inference Hardware
| Hardware | Use Case |
|----------|---------|
| NVIDIA A100/H100 | Production, large models |
| NVIDIA RTX 4090 | Local inference, small-medium models |
| Apple M-series | Consumer local inference (via Metal) |
| AWS Inferentia | AWS-optimized inference chips |
| Google TPU | Google Cloud inference |
Inference Optimization Techniques
| Technique | Description | Speedup |
|-----------|-------------|---------|
| Quantization (INT8/INT4) | Reduce weight precision | 2–4× memory reduction |
| Speculative Decoding | Draft small model predicts tokens, large model verifies | 2–3× throughput |
| Continuous Batching | Dynamic batching of requests | High GPU utilization |
| Flash Attention | Memory-efficient attention computation | 2–4× faster attention |
| Tensor Parallelism | Split model across multiple GPUs | Enables larger models |
| vLLM / PagedAttention | Efficient KV cache memory management | High throughput |
Streaming
Rather than waiting for full output, streaming sends tokens to the user as they are generated:
Latency vs. Throughput
| Metric | Definition | Optimize For |
|--------|-----------|-------------|
| Time to First Token (TTFT) | Time from request to first token | Real-time chat |
| Tokens per Second (TPS) | Generation speed | Long documents |
| Time per Output Token (TPOT) | Inverse of TPS | Batch jobs |
| End-to-End Latency | Total time for complete response | API benchmarking |