Intermediate·5 min read

Latency

Latency is the time elapsed between submitting a prompt to an LLM and receiving the output. It encompasses network transmission, server-side queuing,

Definition

Latency is the time elapsed between submitting a prompt to an LLM and receiving the output. It encompasses network transmission, server-side queuing, model computation (prefill + decode), and response delivery. Latency is a critical quality-of-service metric for LLM applications.

Latency Components

`

Total Latency = Network (client→server)

+ Queue wait time

+ Prefill time (process input tokens)

+ Decode time (generate output tokens × N)

+ Network (server→client)

`

Key Latency Metrics

Time to First Token (TTFT)

  • Time from sending request to receiving the first output token
  • Dominated by: network RTT + queue + prefill computation
  • Critical for perceived responsiveness in chat UIs
  • Streaming enables users to see output while generation continues
  • Time per Output Token (TPOT) / Tokens per Second (TPS)

  • How fast the model generates each subsequent token
  • TPOT = milliseconds per token; TPS = tokens per second (inverse)
  • Dominated by: memory bandwidth (reading KV cache + weights)
  • Typical: 20–150 TPS for hosted APIs
  • End-to-End Latency

  • Total time from request to complete response
  • = TTFT + (output_tokens × TPOT)
  • Relevant for batch jobs and non-streaming use cases
  • Time Between Tokens (TBT)

  • Consistency of generation speed across tokens
  • High variance → choppy streaming experience
  • Typical Latency Ranges (2024)

    | Model / Service | TTFT | TPS |

    |----------------|------|-----|

    | GPT-4o | ~0.5–1s | 40–80 |

    | Claude 3.5 Sonnet | ~0.5–1.5s | 50–100 |

    | GPT-3.5 Turbo | ~0.3–0.5s | 80–150 |

    | Local LLaMA 3 8B (GPU) | ~0.1–0.3s | 50–200 |

    | Local LLaMA 3 70B (GPU) | ~0.3–0.8s | 15–50 |

    Factors Affecting Latency

    Input Side

    | Factor | Effect on Latency |

    |--------|------------------|

    | Prompt length (tokens) | ↑ Prefill time linearly |

    | Context window usage | ↑ Prefill time |

    | KV cache miss (no caching) | ↑ TTFT significantly |

    Model Side

    | Factor | Effect on Latency |

    |--------|------------------|

    | Model size (parameters) | ↑ Both TTFT and TPOT |

    | Number of layers | ↑ TPOT |

    | Attention head count | ↑ Memory bandwidth requirements |

    | Quantization level | ↓ TPOT significantly |

    Output Side

    | Factor | Effect on Latency |

    |--------|------------------|

    | Output length (tokens) | ↑ Total latency linearly |

    | max_tokens setting | Doesn't affect speed, only when to stop |

    Infrastructure Side

    | Factor | Effect on Latency |

    |--------|------------------|

    | GPU memory bandwidth | ↓ TPOT when higher |

    | GPU count (tensor parallel) | ↓ TPOT |

    | Server load / queue depth | ↑ TTFT when high |

    | Geographic region | ↑ Network RTT when far |

    | Prompt caching hit | ↓ TTFT significantly |

    Latency vs. Throughput Trade-off

    | Optimization | Latency Effect | Throughput Effect |

    |-------------|---------------|-----------------|

    | Small batch size | ↓ (good for latency) | ↓ (bad for throughput) |

    | Large batch size | ↑ (bad for latency) | ↑ (good for throughput) |

    | Speculative decoding | ↓ Latency | ≈ Same or slightly better |

    | Quantization (INT8/INT4) | ↓ Latency | ↑ Throughput |

    Latency Reduction Techniques

    Speculative Decoding

  • A small "draft" model generates K tokens quickly
  • The large "target" model verifies all K tokens in one forward pass
  • Accepted tokens are kept; rejected tokens fall back to the target's output
  • Net effect: 2–3× speedup at same quality
  • Prompt Caching

  • Cache the KV states for repeated system prompts / documents
  • Cache hit → prefill cost drops by ~90%
  • Supported by: Claude (Anthropic), GPT-4o (OpenAI prompt caching)
  • Quantization

  • INT8 or INT4 weights → smaller memory footprint → faster memory reads
  • Modest quality loss, significant speed gain
  • Flash Attention

  • Reorders attention computation to be I/O efficient
  • Reduces memory bandwidth requirements for attention
  • 2–4× faster attention computation
  • Streaming

  • Not faster — just makes latency feel lower
  • First tokens appear immediately; users read while rest generates
  • Smaller Models

  • Use the smallest model that meets quality requirements
  • GPT-3.5 vs. GPT-4: ~5× cost, ~3× faster
  • Latency SLAs (Service Level Agreements)

    Common real-world targets by use case:

    | Use Case | TTFT Target | TPOT Target |

    |----------|-------------|-------------|

    | Interactive chat | < 500ms | < 30ms |

    | Autocomplete / copilot | < 200ms | < 20ms |

    | Async document processing | < 5s | Any |

    | Background batch jobs | No TTFT target | Throughput-focused |

    Measuring and Monitoring Latency

  • LangSmith, Helicone, Braintrust — LLM observability platforms
  • Custom logging: timestamp at request, first chunk, last chunk
  • p50 / p95 / p99 percentiles — don't rely on averages
  • Test under realistic load (latency degrades under high concurrency)
  • Related Concepts

  • Inference, Token, Context Window, Streaming, KV Cache, Throughput, Quantization, Prompt Caching

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 6).