Definition

Reasoning models are a class of LLMs that perform extended internal chain-of-thought before producing a final answer — trading increased inference compute and latency for significantly improved accuracy on complex reasoning tasks. Unlike prompting a model to "think step by step," reasoning models generate a hidden internal scratchpad (often thousands of tokens) as an integral part of their inference process.

The Core Paradigm Shift

Standard LLM: token budget fixed, answer generated directly

Reasoning Model: token budget is variable; model "thinks" as long as needed before answering

Standard:

[User Question] → [1-pass generation] → [Answer]

Reasoning Model:

[User Question] → [Internal thinking: 500–10,000 tokens of reasoning] → [Final Answer]

↑ hidden from user / shown as collapsed block

Why Reasoning Models Work

1. More compute at inference = better answers — the "test-time compute scaling" insight

2. Each thinking token is real computation that conditions subsequent predictions

3. The model can explore, backtrack, verify, and self-correct within the thinking block

4. For hard problems, allocating more thinking tokens dramatically improves success rates

Key Reasoning Models (2024–2025)

| Model | Organization | Thinking Mechanism |

|-------|-------------|-------------------|

| o1 / o3 | OpenAI | Hidden chain-of-thought, RL-trained |

| o1-mini / o1-pro | OpenAI | Same but size-varied |

| Claude 3.5+ (extended thinking) | Anthropic | Visible block |

| DeepSeek-R1 | DeepSeek | Open-weights reasoning model |

| Gemini 2.0 Flash Thinking | Google | Experimental thinking mode |

| QwQ-32B | Alibaba | Open-weights reasoning |

How They Are Trained

Reasoning models are not just prompted — they are trained differently:

1. Supervised warm-up: fine-tune on examples with explicit reasoning chains (CoT data)

2. Reinforcement Learning: use outcome-based rewards (is the final answer correct?) rather than supervised imitation

- The model is rewarded for correct answers, not for any specific reasoning format

- RL allows the model to discover novel reasoning strategies

3. Process Reward Models (PRMs): reward models that evaluate the quality of each reasoning step, not just the final answer

The RL training teaches the model:

When to think more vs. move on
How to backtrack from wrong paths
Verification and self-correction strategies
Breaking hard problems into sub-problems

Extended Thinking in Claude (Anthropic)

`python

response = client.messages.create(

model="claude-opus-4-6",

max_tokens=16000,

thinking={

"type": "enabled",

"budget_tokens": 10000 # max tokens for thinking

messages=[{"role": "user", "content": "Solve this competition math problem..."}]

)

Response contains:

- thinking block (the scratchpad, may be shown or hidden)

- text block (the final answer)

Test-Time Compute Scaling

A fundamental insight: you can trade compute for quality at inference time

More thinking tokens → better answers (up to a point)

This creates a new axis of scaling beyond training:

Training compute scaling: bigger models, more data (expensive, one-time)
Test-time compute scaling: more thinking at inference (cheaper, per-query)

For hard problems (competition math, complex code), using 10× more thinking tokens can increase success rate from 30% to 90%.

When Reasoning Models Shine

| Task | Benefit |

|------|---------|

| Competition mathematics (AMC/AIME) | High — multi-step proofs |

| Hard coding problems (LeetCode Hard) | High — algorithm design |

| Multi-step logical reasoning | High |

| Scientific research problems | High |

| Complex planning | High |

| Simple Q&A / creative writing | Low / negative (wasted cost) |

| Fast chatbot responses | Not suitable (high latency) |

Latency and Cost Trade-offs

Reasoning models are expensive and slow:

| Model | TTFT | Cost vs. Standard |

|-------|------|------------------|

| GPT-4o | ~1s | 1× |

| o1 | ~15–60s | ~5–10× |

| o3 | ~30–120s | ~50× |

| Claude + extended thinking | ~5–30s | ~2–5× |

Use reasoning models only when accuracy on hard tasks justifies the cost.

Thinking Tokens Are Hidden Context

The thinking/scratchpad tokens count against the context window
They are typically not shown to end users (collapsed or hidden)
In Claude's API, thinking blocks are clearly separated from the final response
The model cannot "reopen" thinking after starting its final response

Reasoning vs. Chain-of-Thought Prompting

| Aspect | CoT Prompting | Reasoning Model |

|--------|---------------|-----------------|

| Mechanism | User adds "think step by step" to prompt | Model trained to think internally |

| Thinking visibility | Shown in output | Usually hidden |

| Quality | Moderate improvement | Large improvement on hard tasks |

| Cost | Normal token cost | 5–50× more tokens |

| Training required | No | Yes (RL training) |

| Backtracking | No (linear) | Yes (self-correction) |

DeepSeek-R1 (Open Weights)

DeepSeek-R1 is notable as an open-weights reasoning model (MIT license):

Trained with GRPO (Group Relative Policy Optimization) — a simpler RL variant
Shows clear ... reasoning blocks in output
Performance competitive with o1 on math/code benchmarks
Demonstrates RL-based reasoning training is reproducible at lower cost

Related Concepts

Chain of Thought, Inference, RLHF, Scaling Laws, Test-Time Compute, Benchmarks

Reasoning Models / Extended Thinking