Definition
Reasoning models are a class of LLMs that perform extended internal chain-of-thought before producing a final answer — trading increased inference compute and latency for significantly improved accuracy on complex reasoning tasks. Unlike prompting a model to "think step by step," reasoning models generate a hidden internal scratchpad (often thousands of tokens) as an integral part of their inference process.
The Core Paradigm Shift
Standard LLM: token budget fixed, answer generated directly
Reasoning Model: token budget is variable; model "thinks" as long as needed before answering
`
Standard:
[User Question] → [1-pass generation] → [Answer]
Reasoning Model:
[User Question] → [Internal thinking: 500–10,000 tokens of reasoning] → [Final Answer]
↑ hidden from user / shown as collapsed block
`
Why Reasoning Models Work
1. More compute at inference = better answers — the "test-time compute scaling" insight
2. Each thinking token is real computation that conditions subsequent predictions
3. The model can explore, backtrack, verify, and self-correct within the thinking block
4. For hard problems, allocating more thinking tokens dramatically improves success rates
Key Reasoning Models (2024–2025)
| Model | Organization | Thinking Mechanism |
|-------|-------------|-------------------|
| o1 / o3 | OpenAI | Hidden chain-of-thought, RL-trained |
| o1-mini / o1-pro | OpenAI | Same but size-varied |
| Claude 3.5+ (extended thinking) | Anthropic | Visible block |
| DeepSeek-R1 | DeepSeek | Open-weights reasoning model |
| Gemini 2.0 Flash Thinking | Google | Experimental thinking mode |
| QwQ-32B | Alibaba | Open-weights reasoning |
How They Are Trained
Reasoning models are not just prompted — they are trained differently:
1. Supervised warm-up: fine-tune on examples with explicit reasoning chains (CoT data)
2. Reinforcement Learning: use outcome-based rewards (is the final answer correct?) rather than supervised imitation
- The model is rewarded for correct answers, not for any specific reasoning format
- RL allows the model to discover novel reasoning strategies
3. Process Reward Models (PRMs): reward models that evaluate the quality of each reasoning step, not just the final answer
The RL training teaches the model:
- When to think more vs. move on
- How to backtrack from wrong paths
- Verification and self-correction strategies
- Breaking hard problems into sub-problems
- Training compute scaling: bigger models, more data (expensive, one-time)
- Test-time compute scaling: more thinking at inference (cheaper, per-query)
- The thinking/scratchpad tokens count against the context window
- They are typically not shown to end users (collapsed or hidden)
- In Claude's API, thinking blocks are clearly separated from the final response
- The model cannot "reopen" thinking after starting its final response
- Trained with GRPO (Group Relative Policy Optimization) — a simpler RL variant
- Shows clear
reasoning blocks in output... - Performance competitive with o1 on math/code benchmarks
- Demonstrates RL-based reasoning training is reproducible at lower cost
- Chain of Thought, Inference, RLHF, Scaling Laws, Test-Time Compute, Benchmarks
Extended Thinking in Claude (Anthropic)
`python
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000 # max tokens for thinking
},
messages=[{"role": "user", "content": "Solve this competition math problem..."}]
)
Response contains:
- thinking block (the scratchpad, may be shown or hidden)
- text block (the final answer)
`
Test-Time Compute Scaling
A fundamental insight: you can trade compute for quality at inference time
`
More thinking tokens → better answers (up to a point)
`
This creates a new axis of scaling beyond training:
For hard problems (competition math, complex code), using 10× more thinking tokens can increase success rate from 30% to 90%.
When Reasoning Models Shine
| Task | Benefit |
|------|---------|
| Competition mathematics (AMC/AIME) | High — multi-step proofs |
| Hard coding problems (LeetCode Hard) | High — algorithm design |
| Multi-step logical reasoning | High |
| Scientific research problems | High |
| Complex planning | High |
| Simple Q&A / creative writing | Low / negative (wasted cost) |
| Fast chatbot responses | Not suitable (high latency) |
Latency and Cost Trade-offs
Reasoning models are expensive and slow:
| Model | TTFT | Cost vs. Standard |
|-------|------|------------------|
| GPT-4o | ~1s | 1× |
| o1 | ~15–60s | ~5–10× |
| o3 | ~30–120s | ~50× |
| Claude + extended thinking | ~5–30s | ~2–5× |
Use reasoning models only when accuracy on hard tasks justifies the cost.
Thinking Tokens Are Hidden Context
Reasoning vs. Chain-of-Thought Prompting
| Aspect | CoT Prompting | Reasoning Model |
|--------|---------------|-----------------|
| Mechanism | User adds "think step by step" to prompt | Model trained to think internally |
| Thinking visibility | Shown in output | Usually hidden |
| Quality | Moderate improvement | Large improvement on hard tasks |
| Cost | Normal token cost | 5–50× more tokens |
| Training required | No | Yes (RL training) |
| Backtracking | No (linear) | Yes (self-correction) |
DeepSeek-R1 (Open Weights)
DeepSeek-R1 is notable as an open-weights reasoning model (MIT license):