Definition
In-Context Learning (ICL) is the emergent ability of LLMs to learn a new task or adapt to new patterns by reading examples provided in the prompt — without any gradient updates or weight changes to the model. The model "learns" from demonstrations purely through the forward pass of inference.
Key Distinction
| Learning Type | Weight Update? | When? | Examples Needed |
|--------------|---------------|-------|----------------|
| Pre-training | Yes | Before deployment | Trillions of tokens |
| Fine-tuning | Yes | Before deployment | Thousands of examples |
| In-Context Learning | No | At inference time | 1–100 examples in prompt |
ICL is remarkable because it happens with zero parameter updates — the model processes examples and adapts its outputs in a single forward pass.
Why ICL Works: The Mechanistic View
Research (Olsson et al., 2022) identified induction heads — attention heads that:
1. Match a pattern: "previous token A was followed by B"
2. When they see token A again, copy B as a high-probability next token
3. Chain together to implement more complex pattern completion
At sufficient scale, induction heads generalize from single token-pairs to complex input-output patterns (full ICL).
ICL vs. Few-Shot Prompting
These terms are often used interchangeably but have a distinction:
- Few-shot prompting: the practical technique (providing examples in prompt)
- In-context learning: the theoretical phenomenon (how the model adapts from examples)
- Few-shot prompting leverages in-context learning
- Provide hundreds or thousands of examples in the context
- Approaches fine-tuning quality without training
- Especially powerful for tasks with highly consistent format
- Few-Shot, Zero-Shot, Chain of Thought, Emergent Abilities, Context Window, Attention
What Models Actually Do During ICL
Current research suggests models don't fully "learn" the task — they:
1. Locate similar patterns from pre-training that match the examples
2. Infer the task format from the example structure
3. Adapt output format to match the demonstrated pattern
This means ICL is most effective for tasks that were represented in some form during pre-training.
ICL Performance Factors
What Makes ICL Work Well
| Factor | Effect |
|--------|--------|
| More examples | Generally better, up to context limit |
| High-quality examples | Critical — bad examples hurt performance |
| Diverse examples | Better generalization to test inputs |
| Consistent format | Clear pattern → better imitation |
| Representative examples | Match the distribution of test inputs |
What Makes ICL Fail
| Factor | Effect |
|--------|--------|
| Wrong label examples | Surprisingly, random labels barely hurt — format matters more than label correctness (Min et al., 2022) |
| Inconsistent format | Model can't identify the pattern |
| Novel task type | Not seen in pre-training → ICL limited |
| Small model | ICL is an emergent ability requiring scale |
| Very long examples | Token budget exceeded before test input |
The Surprising Robustness of ICL
A counterintuitive finding: wrong labels barely matter
`
Standard few-shot:
Input: "I love this movie" → Label: Positive
Input: "Terrible experience" → Label: Negative
Random label few-shot:
Input: "I love this movie" → Label: Negative ← WRONG
Input: "Terrible experience" → Label: Positive ← WRONG
Performance: Nearly identical!
`
This suggests the model primarily uses examples to learn the format and task structure, not the actual input-output mapping — it's drawing on pre-trained knowledge.
ICL in Practice
Classification Template
`
Sentiment analysis:
"Great product!" → positive
"Would not recommend" → negative
"Average, nothing special" → neutral
"Works exactly as described" →
`
Extraction Template
`
Extract the company and amount from financial news.
"Google acquired DeepMind for $400M" → {"company": "DeepMind", "amount": "$400M"}
"Microsoft paid $26B to acquire LinkedIn" → {"company": "LinkedIn", "amount": "$26B"}
"Amazon bought Whole Foods for $13.7B" → {"company": "Whole Foods", "amount": "$13.7B"}
"Salesforce completed its $27.7B purchase of Slack" →
`
Format Teaching
`
Convert to military time:
3:30 PM → 15:30
10:15 AM → 10:15
11:45 PM → 23:45
6:00 AM →
`
ICL vs. Fine-Tuning: When to Use Which
| Use ICL | Use Fine-Tuning |
|---------|----------------|
| Task requirements change frequently | Consistent, stable task |
| Small amount of examples (<100) | Large dataset available (1K+) |
| Prototyping and experimentation | Production, high-volume |
| Token budget not critical | Token efficiency matters |
| Don't want training overhead | Can afford training compute |
ICL Scaling: Many-Shot ICL
Recent trend: using many-shot ICL with very long context windows (Gemini 1M tokens):