Latency — FDE@ProdAI Blog

Definition

Latency is the time elapsed between submitting a prompt to an LLM and receiving the output. It encompasses network transmission, server-side queuing, model computation (prefill + decode), and response delivery. Latency is a critical quality-of-service metric for LLM applications.

Latency Components

Total Latency = Network (client→server)

+ Queue wait time

+ Prefill time (process input tokens)

+ Decode time (generate output tokens × N)

+ Network (server→client)

Key Latency Metrics

Time to First Token (TTFT)

Time from sending request to receiving the first output token
Dominated by: network RTT + queue + prefill computation
Critical for perceived responsiveness in chat UIs
Streaming enables users to see output while generation continues

Time per Output Token (TPOT) / Tokens per Second (TPS)

How fast the model generates each subsequent token
TPOT = milliseconds per token; TPS = tokens per second (inverse)
Dominated by: memory bandwidth (reading KV cache + weights)
Typical: 20–150 TPS for hosted APIs

End-to-End Latency

Total time from request to complete response
= TTFT + (output_tokens × TPOT)
Relevant for batch jobs and non-streaming use cases

Time Between Tokens (TBT)

Consistency of generation speed across tokens
High variance → choppy streaming experience

Typical Latency Ranges (2024)

| Model / Service | TTFT | TPS |

|----------------|------|-----|

| GPT-4o | ~0.5–1s | 40–80 |

| Claude 3.5 Sonnet | ~0.5–1.5s | 50–100 |

| GPT-3.5 Turbo | ~0.3–0.5s | 80–150 |

| Local LLaMA 3 8B (GPU) | ~0.1–0.3s | 50–200 |

| Local LLaMA 3 70B (GPU) | ~0.3–0.8s | 15–50 |

Factors Affecting Latency

Input Side

| Factor | Effect on Latency |

|--------|------------------|

| Prompt length (tokens) | ↑ Prefill time linearly |

| Context window usage | ↑ Prefill time |

| KV cache miss (no caching) | ↑ TTFT significantly |

Model Side

| Factor | Effect on Latency |

|--------|------------------|

| Model size (parameters) | ↑ Both TTFT and TPOT |

| Number of layers | ↑ TPOT |

| Attention head count | ↑ Memory bandwidth requirements |

| Quantization level | ↓ TPOT significantly |

Output Side

| Factor | Effect on Latency |

|--------|------------------|

| Output length (tokens) | ↑ Total latency linearly |

| max_tokens setting | Doesn't affect speed, only when to stop |

Infrastructure Side

| Factor | Effect on Latency |

|--------|------------------|

| GPU memory bandwidth | ↓ TPOT when higher |

| GPU count (tensor parallel) | ↓ TPOT |

| Server load / queue depth | ↑ TTFT when high |

| Geographic region | ↑ Network RTT when far |

| Prompt caching hit | ↓ TTFT significantly |

Latency vs. Throughput Trade-off

| Optimization | Latency Effect | Throughput Effect |

|-------------|---------------|-----------------|

| Small batch size | ↓ (good for latency) | ↓ (bad for throughput) |

| Large batch size | ↑ (bad for latency) | ↑ (good for throughput) |

| Speculative decoding | ↓ Latency | ≈ Same or slightly better |

| Quantization (INT8/INT4) | ↓ Latency | ↑ Throughput |

Latency Reduction Techniques

Speculative Decoding

A small "draft" model generates K tokens quickly
The large "target" model verifies all K tokens in one forward pass
Accepted tokens are kept; rejected tokens fall back to the target's output
Net effect: 2–3× speedup at same quality

Prompt Caching

Cache the KV states for repeated system prompts / documents
Cache hit → prefill cost drops by ~90%
Supported by: Claude (Anthropic), GPT-4o (OpenAI prompt caching)

Quantization

INT8 or INT4 weights → smaller memory footprint → faster memory reads
Modest quality loss, significant speed gain

Flash Attention

Reorders attention computation to be I/O efficient
Reduces memory bandwidth requirements for attention
2–4× faster attention computation

Streaming

Not faster — just makes latency feel lower
First tokens appear immediately; users read while rest generates

Smaller Models

Use the smallest model that meets quality requirements
GPT-3.5 vs. GPT-4: ~5× cost, ~3× faster

Latency SLAs (Service Level Agreements)

Common real-world targets by use case:

| Use Case | TTFT Target | TPOT Target |

|----------|-------------|-------------|

| Interactive chat | < 500ms | < 30ms |

| Autocomplete / copilot | < 200ms | < 20ms |

| Async document processing | < 5s | Any |

| Background batch jobs | No TTFT target | Throughput-focused |

Measuring and Monitoring Latency

LangSmith, Helicone, Braintrust — LLM observability platforms
Custom logging: timestamp at request, first chunk, last chunk
p50 / p95 / p99 percentiles — don't rely on averages
Test under realistic load (latency degrades under high concurrency)

Related Concepts

Inference, Token, Context Window, Streaming, KV Cache, Throughput, Quantization, Prompt Caching

Definition

Latency Components

Key Latency Metrics

Time to First Token (TTFT)

Time per Output Token (TPOT) / Tokens per Second (TPS)

End-to-End Latency

Time Between Tokens (TBT)

Typical Latency Ranges (2024)

Factors Affecting Latency

Input Side

Model Side

Output Side

Infrastructure Side

Latency vs. Throughput Trade-off

Latency Reduction Techniques

Speculative Decoding

Prompt Caching

Quantization

Flash Attention

Streaming

Smaller Models

Latency SLAs (Service Level Agreements)

Measuring and Monitoring Latency

Related Concepts

Go Deeper With Live Instruction