Definition
Streaming is the practice of delivering LLM output tokens to the user incrementally as they are generated, rather than waiting for the complete response. It is the fundamental UX pattern behind all modern LLM chat interfaces — the characteristic "typing" appearance of AI responses.
Why Streaming Matters
Without streaming, a 500-token response at 50 TPS takes 10 seconds before the user sees anything. With streaming, the first token appears in ~0.5s and the user reads as generation continues.
`
Without streaming: [10 second wait] → entire response appears at once
With streaming: [0.5s] → first token → token → token → token → ...
`
Psychologically: streaming feels dramatically faster even though total generation time is identical.
How Streaming Works
Server-Sent Events (SSE)
The standard HTTP-based streaming protocol for LLM APIs:
`
HTTP Response: Content-Type: text/event-stream
data: {"delta": {"text": "The "}}
data: {"delta": {"text": "capital "}}
data: {"delta": {"text": "of "}}
data: {"delta": {"text": "France "}}
data: {"delta": {"text": "is "}}
data: {"delta": {"text": "Paris."}}
data: [DONE]
`
WebSockets
Bidirectional streaming for interactive applications (speech, real-time collaboration).
Streaming with Major APIs
OpenAI
`python
from openai import OpenAI
client = OpenAI()
with client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Count to 10"}],
stream=True,
) as stream:
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
`
Anthropic Claude
`python
import anthropic
client = anthropic.Anthropic()
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Count to 10"}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
`
AWS Bedrock (Converse Stream)
`python
response = bedrock.converse_stream(
modelId="us.anthropic.claude-sonnet-4-6",
messages=[{"role": "user", "content": [{"text": "Count to 10"}]}]
)
for event in response["stream"]:
if "contentBlockDelta" in event:
print(event["contentBlockDelta"]["delta"]["text"], end="")
`
Streaming Events Beyond Text
Modern streaming APIs send structured events, not just text chunks:
| Event Type | Description |
|------------|-------------|
| message_start | Metadata about the response (model, id) |
| content_block_start | Beginning of a content block (text, tool_use, thinking) |
| content_block_delta | Incremental content update |
| content_block_stop | Block completed |
| message_delta | Token counts, stop reason |
| message_stop | Complete response finished |
Streaming with Tool Use
When the model calls a tool during streaming:
1. Stream: text tokens up to the tool call
2. Stream: the tool call parameters (JSON, incrementally)
3. Pause: developer executes the tool
4. Resume: continue streaming the response after tool result
Building a Streaming UI
`javascript
// React example
const [response, setResponse] = useState("");
const streamResponse = async (prompt) => {
const stream = await fetchStreamFromAPI(prompt);
for await (const chunk of stream) {
setResponse(prev => prev + chunk.text); // append each token
}
};
`
Streaming Performance Metrics
| Metric | What It Measures |
|--------|-----------------|
| Time to First Token (TTFT) | Latency before streaming starts |
| Tokens per second (TPS) | Streaming speed |
| Time Between Tokens (TBT) | Smoothness — high variance = choppy |
| Time to Last Token (TTLT) | Total response time |
Streaming in Production Systems
Backend Proxy Pattern
`
User ← SSE ← [Your Backend] ← SSE ← [LLM API]
`
Your backend can:
- Validate/filter output before streaming to user
- Apply guardrails mid-stream
- Log the complete response for observability
- Add rate limiting or auth
- Frontend sends abort signal
- Backend cancels the upstream API request
- API providers support request cancellation to stop billing
- Inference, Latency, Token, API, Reasoning Models, Tool Use, Context Window
Mid-Stream Interruption
Users may want to stop generation early:
Streaming vs. Batch for Different Use Cases
| Use Case | Recommendation |
|----------|---------------|
| Interactive chat | Stream always |
| Document generation (user watching) | Stream |
| Background processing (no user) | Batch (simpler code) |
| Evaluation/testing pipelines | Batch |
| Voice synthesis (TTS) | Stream (chunk audio as text arrives) |
Streaming with Extended Thinking
Reasoning models stream differently:
1. Stream thinking tokens (may be hidden to user, shown in dev tools)
2. Stream final response tokens
3. Thinking blocks complete before text blocks start
Claude's streaming API sends separate events for thinking vs. text content blocks.