Context Window — FDE@ProdAI Blog

Definition

The context window is the maximum number of tokens an LLM can process in a single forward pass — the total span of text the model can "see" at once, including the system prompt, conversation history, injected documents, and the output it is generating.

What Counts Toward the Context Window

[System Prompt tokens] + [Conversation History tokens] + [User Input tokens] + [Output tokens]

= Total tokens consumed from context window

Once the context window is full, the model cannot see earlier content.

Context Window Sizes (2024–2025)

| Model | Context Window |

|-------|---------------|

| GPT-3.5 Turbo | 16K tokens |

| GPT-4o | 128K tokens |

| Claude 3.5 Sonnet | 200K tokens |

| Claude 3 Opus | 200K tokens |

| Gemini 1.5 Pro | 1M tokens |

| Gemini 1.5 Flash | 1M tokens |

| LLaMA 3.1 | 128K tokens |

| Mistral Large | 128K tokens |

Tokens → Real-World Equivalent

| Token Count | Approximate Content |

|-------------|-------------------|

| 1K tokens | ~750 words / ~3 pages |

| 8K tokens | ~6,000 words / ~24 pages |

| 32K tokens | ~24,000 words / ~100 pages |

| 100K tokens | ~75,000 words / ~300 pages |

| 200K tokens | The entire Lord of the Rings trilogy |

| 1M tokens | ~750,000 words / entire Harry Potter series |

How the Transformer Processes Context

The self-attention mechanism computes relationships between every pair of tokens in the context:

Attention complexity: O(n²) where n = context length
Every token can attend to every other token
This is why larger context windows are computationally expensive

The Lost-in-the-Middle Problem

Research shows LLMs are better at using information at the beginning and end of the context window — content in the middle is more likely to be ignored:

[Beginning: strong recall] [Middle: weak recall] [End: strong recall]

Mitigation: place critical instructions at the start and end; use RAG to keep context focused.

Context Window vs. Memory

| Concept | Scope | Persistence |

|---------|-------|-------------|

| Context Window | Within one API call/session | Temporary — lost after session |

| Fine-tuning (parametric memory) | Baked into weights | Permanent |

| RAG (retrieval memory) | Retrieved at query time | Persistent in vector DB |

| External memory (DB) | Stored outside model | Persistent, explicit |

Managing Context in Production

Sliding Window

Keep only the most recent N tokens of conversation history
Oldest messages are dropped when context fills
Simple but loses early conversation context

Summarization

Periodically summarize the conversation so far
Replace raw history with compressed summary
Retains key facts with fewer tokens

RAG (Retrieval-Augmented Generation)

Don't stuff full documents into context
Instead, retrieve only the relevant chunks at query time
Keeps context lean and focused

Chunking

Split long documents into overlapping chunks
Retrieve and inject only relevant chunks
Standard approach for document Q&A

Cost Implications

Input tokens (context) are typically cheaper than output tokens
But very long contexts still add up significantly in cost
Prompt caching (Claude, GPT-4o) amortizes the cost of repeated system prompts

Prompt Caching (Extended Context Optimization)

If the beginning of the prompt (system prompt + documents) is the same across requests, it can be cached
Subsequent calls pay ~10% of the full input price for the cached portion
Critical for applications that inject large knowledge bases

Effective Context Length vs. Maximum

Maximum context window: the hard limit
Effective context length: how much context the model reliably uses
For many models, effective < maximum (degradation observed at long contexts)
"Needle in a haystack" evals test if models can find a fact buried in a long context

Related Concepts

Token, Prompt, RAG, Attention, Long Context, Summarization, Chunking, Prompt Caching