Intermediate·5 min read

Reasoning Models / Extended Thinking

Reasoning models are a class of LLMs that perform extended internal chain-of-thought before producing a final answer — trading increased inference com

Definition

Reasoning models are a class of LLMs that perform extended internal chain-of-thought before producing a final answer — trading increased inference compute and latency for significantly improved accuracy on complex reasoning tasks. Unlike prompting a model to "think step by step," reasoning models generate a hidden internal scratchpad (often thousands of tokens) as an integral part of their inference process.

The Core Paradigm Shift

Standard LLM: token budget fixed, answer generated directly

Reasoning Model: token budget is variable; model "thinks" as long as needed before answering

`

Standard:

[User Question] → [1-pass generation] → [Answer]

Reasoning Model:

[User Question] → [Internal thinking: 500–10,000 tokens of reasoning] → [Final Answer]

↑ hidden from user / shown as collapsed block

`

Why Reasoning Models Work

1. More compute at inference = better answers — the "test-time compute scaling" insight

2. Each thinking token is real computation that conditions subsequent predictions

3. The model can explore, backtrack, verify, and self-correct within the thinking block

4. For hard problems, allocating more thinking tokens dramatically improves success rates

Key Reasoning Models (2024–2025)

| Model | Organization | Thinking Mechanism |

|-------|-------------|-------------------|

| o1 / o3 | OpenAI | Hidden chain-of-thought, RL-trained |

| o1-mini / o1-pro | OpenAI | Same but size-varied |

| Claude 3.5+ (extended thinking) | Anthropic | Visible block |

| DeepSeek-R1 | DeepSeek | Open-weights reasoning model |

| Gemini 2.0 Flash Thinking | Google | Experimental thinking mode |

| QwQ-32B | Alibaba | Open-weights reasoning |

How They Are Trained

Reasoning models are not just prompted — they are trained differently:

1. Supervised warm-up: fine-tune on examples with explicit reasoning chains (CoT data)

2. Reinforcement Learning: use outcome-based rewards (is the final answer correct?) rather than supervised imitation

- The model is rewarded for correct answers, not for any specific reasoning format

- RL allows the model to discover novel reasoning strategies

3. Process Reward Models (PRMs): reward models that evaluate the quality of each reasoning step, not just the final answer

The RL training teaches the model:

  • When to think more vs. move on
  • How to backtrack from wrong paths
  • Verification and self-correction strategies
  • Breaking hard problems into sub-problems
  • Extended Thinking in Claude (Anthropic)

    `python

    response = client.messages.create(

    model="claude-opus-4-6",

    max_tokens=16000,

    thinking={

    "type": "enabled",

    "budget_tokens": 10000 # max tokens for thinking

    },

    messages=[{"role": "user", "content": "Solve this competition math problem..."}]

    )

    Response contains:

    - thinking block (the scratchpad, may be shown or hidden)

    - text block (the final answer)

    `

    Test-Time Compute Scaling

    A fundamental insight: you can trade compute for quality at inference time

    `

    More thinking tokens → better answers (up to a point)

    `

    This creates a new axis of scaling beyond training:

  • Training compute scaling: bigger models, more data (expensive, one-time)
  • Test-time compute scaling: more thinking at inference (cheaper, per-query)
  • For hard problems (competition math, complex code), using 10× more thinking tokens can increase success rate from 30% to 90%.

    When Reasoning Models Shine

    | Task | Benefit |

    |------|---------|

    | Competition mathematics (AMC/AIME) | High — multi-step proofs |

    | Hard coding problems (LeetCode Hard) | High — algorithm design |

    | Multi-step logical reasoning | High |

    | Scientific research problems | High |

    | Complex planning | High |

    | Simple Q&A / creative writing | Low / negative (wasted cost) |

    | Fast chatbot responses | Not suitable (high latency) |

    Latency and Cost Trade-offs

    Reasoning models are expensive and slow:

    | Model | TTFT | Cost vs. Standard |

    |-------|------|------------------|

    | GPT-4o | ~1s | 1× |

    | o1 | ~15–60s | ~5–10× |

    | o3 | ~30–120s | ~50× |

    | Claude + extended thinking | ~5–30s | ~2–5× |

    Use reasoning models only when accuracy on hard tasks justifies the cost.

    Thinking Tokens Are Hidden Context

  • The thinking/scratchpad tokens count against the context window
  • They are typically not shown to end users (collapsed or hidden)
  • In Claude's API, thinking blocks are clearly separated from the final response
  • The model cannot "reopen" thinking after starting its final response
  • Reasoning vs. Chain-of-Thought Prompting

    | Aspect | CoT Prompting | Reasoning Model |

    |--------|---------------|-----------------|

    | Mechanism | User adds "think step by step" to prompt | Model trained to think internally |

    | Thinking visibility | Shown in output | Usually hidden |

    | Quality | Moderate improvement | Large improvement on hard tasks |

    | Cost | Normal token cost | 5–50× more tokens |

    | Training required | No | Yes (RL training) |

    | Backtracking | No (linear) | Yes (self-correction) |

    DeepSeek-R1 (Open Weights)

    DeepSeek-R1 is notable as an open-weights reasoning model (MIT license):

  • Trained with GRPO (Group Relative Policy Optimization) — a simpler RL variant
  • Shows clear ... reasoning blocks in output
  • Performance competitive with o1 on math/code benchmarks
  • Demonstrates RL-based reasoning training is reproducible at lower cost
  • Related Concepts

  • Chain of Thought, Inference, RLHF, Scaling Laws, Test-Time Compute, Benchmarks

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 11).