Intermediate·5 min read

Inference

Inference is the process of generating output tokens from a trained LLM. It is the "prediction" phase — using the frozen, trained model weights to pro

Definition

Inference is the process of generating output tokens from a trained LLM. It is the "prediction" phase — using the frozen, trained model weights to produce responses given an input prompt. Inference is everything that happens after training.

The Autoregressive Generation Loop

LLMs generate text one token at a time, feeding each generated token back as input for the next prediction:

`

Step 1: Prompt → [Model] → token_1

Step 2: Prompt + token_1 → [Model] → token_2

Step 3: Prompt + token_1 + token_2 → [Model] → token_3

...

Until: [EOS token] or [max_tokens limit reached]

`

Each step is one forward pass through the full model.

Two Phases of Inference

Prefill Phase

  • Process the entire input prompt in one batch (parallelizable)
  • Compute key-value (KV) pairs for all input tokens
  • Store KV cache for reuse in decode phase
  • Compute-bound (GPU utilization high)
  • Decode Phase (Autoregressive)

  • Generate one token per step
  • Each step uses the cached KV from prefill + all previously generated tokens
  • Memory-bandwidth bound (GPU reads KV cache repeatedly)
  • Sequential — cannot be fully parallelized
  • Sampling Strategies

    The model outputs a probability distribution over the vocabulary. Sampling determines how the next token is chosen:

    Greedy Decoding

  • Always pick the highest-probability token
  • Deterministic (same input → same output every time)
  • Fast, but often produces repetitive, bland text
  • Temperature Sampling

  • Divide logits by temperature T before softmax
  • T < 1.0: sharpens distribution → more focused/deterministic
  • T > 1.0: flattens distribution → more random/creative
  • T = 0: equivalent to greedy
  • Top-K Sampling

  • Restrict sampling to the top K most probable tokens
  • Discard the rest, renormalize, then sample
  • K = 1: greedy; K = 50: balanced variety
  • Top-P (Nucleus) Sampling

  • Restrict to the smallest set of tokens whose cumulative probability ≥ P
  • Dynamic K — adapts to confidence (when model is confident, fewer tokens qualify)
  • P = 0.9: common default; P = 1.0: full distribution
  • Beam Search

  • Maintain B "beams" (candidate sequences) simultaneously
  • At each step, expand all beams and keep top B continuations
  • Final answer: highest-scoring beam
  • More thorough than greedy, but slower and can produce generic text
  • Min-P Sampling

  • Filter out tokens with probability below min_p × (probability of the top token)
  • Relatively new, gaining adoption for quality outputs
  • Key Inference Parameters

    | Parameter | Effect | Typical Range |

    |-----------|--------|--------------|

    | temperature | Randomness of sampling | 0.0–2.0 (default: 0.7–1.0) |

    | top_p | Nucleus sampling threshold | 0.0–1.0 (default: 0.9) |

    | top_k | Top-K candidate pool | 1–100+ (default: 50) |

    | max_tokens | Maximum output length | 1–100K+ |

    | stop | Token sequences that end generation | Custom strings |

    | frequency_penalty | Reduce repetition | 0.0–2.0 |

    | presence_penalty | Encourage topic diversity | 0.0–2.0 |

    KV Cache

    The Key-Value (KV) cache is critical for inference efficiency:

  • During prefill, all K and V matrices for each attention layer are computed and stored
  • During decode, each new token only needs to compute its own K, V and attend to the cache
  • Without KV cache, each decode step would recompute all previous tokens — O(n²) work
  • With KV cache: O(n) per new token
  • KV cache memory: 2 × layers × heads × head_dim × seq_len × batch_size × bytes
  • Inference Hardware

    | Hardware | Use Case |

    |----------|---------|

    | NVIDIA A100/H100 | Production, large models |

    | NVIDIA RTX 4090 | Local inference, small-medium models |

    | Apple M-series | Consumer local inference (via Metal) |

    | AWS Inferentia | AWS-optimized inference chips |

    | Google TPU | Google Cloud inference |

    Inference Optimization Techniques

    | Technique | Description | Speedup |

    |-----------|-------------|---------|

    | Quantization (INT8/INT4) | Reduce weight precision | 2–4× memory reduction |

    | Speculative Decoding | Draft small model predicts tokens, large model verifies | 2–3× throughput |

    | Continuous Batching | Dynamic batching of requests | High GPU utilization |

    | Flash Attention | Memory-efficient attention computation | 2–4× faster attention |

    | Tensor Parallelism | Split model across multiple GPUs | Enables larger models |

    | vLLM / PagedAttention | Efficient KV cache memory management | High throughput |

    Streaming

    Rather than waiting for full output, streaming sends tokens to the user as they are generated:

  • Lower perceived latency (first token arrives quickly)
  • Better user experience for long responses
  • Supported by all major APIs (SSE / WebSocket)
  • Latency vs. Throughput

    | Metric | Definition | Optimize For |

    |--------|-----------|-------------|

    | Time to First Token (TTFT) | Time from request to first token | Real-time chat |

    | Tokens per Second (TPS) | Generation speed | Long documents |

    | Time per Output Token (TPOT) | Inverse of TPS | Batch jobs |

    | End-to-End Latency | Total time for complete response | API benchmarking |

    Inference Serving Frameworks

  • vLLM — high-throughput server, PagedAttention
  • TGI (Text Generation Inference) — HuggingFace's production server
  • Ollama — local inference, user-friendly
  • llama.cpp — CPU/GPU inference, GGUF quantized models
  • TensorRT-LLM — NVIDIA's optimized inference framework
  • Related Concepts

  • Token, Context Window, Temperature, Sampling, KV Cache, Latency, Quantization, Streaming

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 5).