Inference — FDE@ProdAI Blog

Definition

Inference is the process of generating output tokens from a trained LLM. It is the "prediction" phase — using the frozen, trained model weights to produce responses given an input prompt. Inference is everything that happens after training.

The Autoregressive Generation Loop

LLMs generate text one token at a time, feeding each generated token back as input for the next prediction:

Step 1: Prompt → [Model] → token_1

Step 2: Prompt + token_1 → [Model] → token_2

Step 3: Prompt + token_1 + token_2 → [Model] → token_3

...

Until: [EOS token] or [max_tokens limit reached]

Each step is one forward pass through the full model.

Two Phases of Inference

Prefill Phase

Process the entire input prompt in one batch (parallelizable)
Compute key-value (KV) pairs for all input tokens
Store KV cache for reuse in decode phase
Compute-bound (GPU utilization high)

Decode Phase (Autoregressive)

Generate one token per step
Each step uses the cached KV from prefill + all previously generated tokens
Memory-bandwidth bound (GPU reads KV cache repeatedly)
Sequential — cannot be fully parallelized

Sampling Strategies

The model outputs a probability distribution over the vocabulary. Sampling determines how the next token is chosen:

Greedy Decoding

Always pick the highest-probability token
Deterministic (same input → same output every time)
Fast, but often produces repetitive, bland text

Temperature Sampling

Divide logits by temperature T before softmax
T < 1.0: sharpens distribution → more focused/deterministic
T > 1.0: flattens distribution → more random/creative
T = 0: equivalent to greedy

Top-K Sampling

Restrict sampling to the top K most probable tokens
Discard the rest, renormalize, then sample
K = 1: greedy; K = 50: balanced variety

Top-P (Nucleus) Sampling

Restrict to the smallest set of tokens whose cumulative probability ≥ P
Dynamic K — adapts to confidence (when model is confident, fewer tokens qualify)
P = 0.9: common default; P = 1.0: full distribution

Beam Search

Maintain B "beams" (candidate sequences) simultaneously
At each step, expand all beams and keep top B continuations
Final answer: highest-scoring beam
More thorough than greedy, but slower and can produce generic text

Min-P Sampling

Filter out tokens with probability below min_p × (probability of the top token)
Relatively new, gaining adoption for quality outputs

Key Inference Parameters

| Parameter | Effect | Typical Range |

|-----------|--------|--------------|

| temperature | Randomness of sampling | 0.0–2.0 (default: 0.7–1.0) |

| top_p | Nucleus sampling threshold | 0.0–1.0 (default: 0.9) |

| top_k | Top-K candidate pool | 1–100+ (default: 50) |

| max_tokens | Maximum output length | 1–100K+ |

| stop | Token sequences that end generation | Custom strings |

| frequency_penalty | Reduce repetition | 0.0–2.0 |

| presence_penalty | Encourage topic diversity | 0.0–2.0 |

KV Cache

The Key-Value (KV) cache is critical for inference efficiency:

During prefill, all K and V matrices for each attention layer are computed and stored
During decode, each new token only needs to compute its own K, V and attend to the cache
Without KV cache, each decode step would recompute all previous tokens — O(n²) work
With KV cache: O(n) per new token
KV cache memory: 2 × layers × heads × head_dim × seq_len × batch_size × bytes

Inference Hardware

| Hardware | Use Case |

|----------|---------|

| NVIDIA A100/H100 | Production, large models |

| NVIDIA RTX 4090 | Local inference, small-medium models |

| Apple M-series | Consumer local inference (via Metal) |

| AWS Inferentia | AWS-optimized inference chips |

| Google TPU | Google Cloud inference |

Inference Optimization Techniques

| Technique | Description | Speedup |

|-----------|-------------|---------|

| Quantization (INT8/INT4) | Reduce weight precision | 2–4× memory reduction |

| Speculative Decoding | Draft small model predicts tokens, large model verifies | 2–3× throughput |

| Continuous Batching | Dynamic batching of requests | High GPU utilization |

| Flash Attention | Memory-efficient attention computation | 2–4× faster attention |

| Tensor Parallelism | Split model across multiple GPUs | Enables larger models |

| vLLM / PagedAttention | Efficient KV cache memory management | High throughput |

Streaming

Rather than waiting for full output, streaming sends tokens to the user as they are generated:

Lower perceived latency (first token arrives quickly)
Better user experience for long responses
Supported by all major APIs (SSE / WebSocket)

Latency vs. Throughput

| Metric | Definition | Optimize For |

|--------|-----------|-------------|

| Time to First Token (TTFT) | Time from request to first token | Real-time chat |

| Tokens per Second (TPS) | Generation speed | Long documents |

| Time per Output Token (TPOT) | Inverse of TPS | Batch jobs |

| End-to-End Latency | Total time for complete response | API benchmarking |

Inference Serving Frameworks

vLLM — high-throughput server, PagedAttention
TGI (Text Generation Inference) — HuggingFace's production server
Ollama — local inference, user-friendly
llama.cpp — CPU/GPU inference, GGUF quantized models
TensorRT-LLM — NVIDIA's optimized inference framework

Related Concepts

Token, Context Window, Temperature, Sampling, KV Cache, Latency, Quantization, Streaming