Definition
The context window is the maximum number of tokens an LLM can process in a single forward pass — the total span of text the model can "see" at once, including the system prompt, conversation history, injected documents, and the output it is generating.
What Counts Toward the Context Window
`
[System Prompt tokens] + [Conversation History tokens] + [User Input tokens] + [Output tokens]
= Total tokens consumed from context window
`
Once the context window is full, the model cannot see earlier content.
Context Window Sizes (2024–2025)
| Model | Context Window |
|-------|---------------|
| GPT-3.5 Turbo | 16K tokens |
| GPT-4o | 128K tokens |
| Claude 3.5 Sonnet | 200K tokens |
| Claude 3 Opus | 200K tokens |
| Gemini 1.5 Pro | 1M tokens |
| Gemini 1.5 Flash | 1M tokens |
| LLaMA 3.1 | 128K tokens |
| Mistral Large | 128K tokens |
Tokens → Real-World Equivalent
| Token Count | Approximate Content |
|-------------|-------------------|
| 1K tokens | ~750 words / ~3 pages |
| 8K tokens | ~6,000 words / ~24 pages |
| 32K tokens | ~24,000 words / ~100 pages |
| 100K tokens | ~75,000 words / ~300 pages |
| 200K tokens | The entire Lord of the Rings trilogy |
| 1M tokens | ~750,000 words / entire Harry Potter series |
How the Transformer Processes Context
The self-attention mechanism computes relationships between every pair of tokens in the context:
- Attention complexity: O(n²) where n = context length
- Every token can attend to every other token
- This is why larger context windows are computationally expensive
- Keep only the most recent N tokens of conversation history
- Oldest messages are dropped when context fills
- Simple but loses early conversation context
- Periodically summarize the conversation so far
- Replace raw history with compressed summary
- Retains key facts with fewer tokens
- Don't stuff full documents into context
- Instead, retrieve only the relevant chunks at query time
- Keeps context lean and focused
- Split long documents into overlapping chunks
- Retrieve and inject only relevant chunks
- Standard approach for document Q&A
- Input tokens (context) are typically cheaper than output tokens
- But very long contexts still add up significantly in cost
- Prompt caching (Claude, GPT-4o) amortizes the cost of repeated system prompts
- If the beginning of the prompt (system prompt + documents) is the same across requests, it can be cached
- Subsequent calls pay ~10% of the full input price for the cached portion
- Critical for applications that inject large knowledge bases
- Maximum context window: the hard limit
- Effective context length: how much context the model reliably uses
- For many models, effective < maximum (degradation observed at long contexts)
- "Needle in a haystack" evals test if models can find a fact buried in a long context
- Token, Prompt, RAG, Attention, Long Context, Summarization, Chunking, Prompt Caching
The Lost-in-the-Middle Problem
Research shows LLMs are better at using information at the beginning and end of the context window — content in the middle is more likely to be ignored:
`
[Beginning: strong recall] [Middle: weak recall] [End: strong recall]
`
Mitigation: place critical instructions at the start and end; use RAG to keep context focused.
Context Window vs. Memory
| Concept | Scope | Persistence |
|---------|-------|-------------|
| Context Window | Within one API call/session | Temporary — lost after session |
| Fine-tuning (parametric memory) | Baked into weights | Permanent |
| RAG (retrieval memory) | Retrieved at query time | Persistent in vector DB |
| External memory (DB) | Stored outside model | Persistent, explicit |