In-Context Learning (ICL) — FDE@ProdAI Blog

Definition

In-Context Learning (ICL) is the emergent ability of LLMs to learn a new task or adapt to new patterns by reading examples provided in the prompt — without any gradient updates or weight changes to the model. The model "learns" from demonstrations purely through the forward pass of inference.

Key Distinction

|--------------|---------------|-------|----------------|

ICL is remarkable because it happens with zero parameter updates — the model processes examples and adapts its outputs in a single forward pass.

Why ICL Works: The Mechanistic View

Research (Olsson et al., 2022) identified induction heads — attention heads that:

1. Match a pattern: "previous token A was followed by B"

2. When they see token A again, copy B as a high-probability next token

3. Chain together to implement more complex pattern completion

At sufficient scale, induction heads generalize from single token-pairs to complex input-output patterns (full ICL).

ICL vs. Few-Shot Prompting

These terms are often used interchangeably but have a distinction:

Few-shot prompting: the practical technique (providing examples in prompt)
In-context learning: the theoretical phenomenon (how the model adapts from examples)
Few-shot prompting leverages in-context learning

What Models Actually Do During ICL

Current research suggests models don't fully "learn" the task — they:

1. Locate similar patterns from pre-training that match the examples

2. Infer the task format from the example structure

3. Adapt output format to match the demonstrated pattern

This means ICL is most effective for tasks that were represented in some form during pre-training.

ICL Performance Factors

What Makes ICL Work Well

| Factor | Effect |

|--------|--------|

| More examples | Generally better, up to context limit |

| High-quality examples | Critical — bad examples hurt performance |

| Diverse examples | Better generalization to test inputs |

| Consistent format | Clear pattern → better imitation |

| Representative examples | Match the distribution of test inputs |

What Makes ICL Fail

| Factor | Effect |

|--------|--------|

| Wrong label examples | Surprisingly, random labels barely hurt — format matters more than label correctness (Min et al., 2022) |

| Inconsistent format | Model can't identify the pattern |

| Novel task type | Not seen in pre-training → ICL limited |

| Small model | ICL is an emergent ability requiring scale |

| Very long examples | Token budget exceeded before test input |

The Surprising Robustness of ICL

A counterintuitive finding: wrong labels barely matter

Standard few-shot:

Input: "I love this movie" → Label: Positive

Input: "Terrible experience" → Label: Negative

Random label few-shot:

Input: "I love this movie" → Label: Negative ← WRONG

Input: "Terrible experience" → Label: Positive ← WRONG

Performance: Nearly identical!

This suggests the model primarily uses examples to learn the format and task structure, not the actual input-output mapping — it's drawing on pre-trained knowledge.

ICL in Practice

Classification Template

Sentiment analysis:

"Great product!" → positive

"Would not recommend" → negative

"Average, nothing special" → neutral

"Works exactly as described" →

Extraction Template

Extract the company and amount from financial news.

"Google acquired DeepMind for $400M" → {"company": "DeepMind", "amount": "$400M"}

"Microsoft paid $26B to acquire LinkedIn" → {"company": "LinkedIn", "amount": "$26B"}

"Amazon bought Whole Foods for $13.7B" → {"company": "Whole Foods", "amount": "$13.7B"}

"Salesforce completed its $27.7B purchase of Slack" →

Format Teaching

Convert to military time:

3:30 PM → 15:30

10:15 AM → 10:15

11:45 PM → 23:45

6:00 AM →

ICL vs. Fine-Tuning: When to Use Which

| Use ICL | Use Fine-Tuning |

|---------|----------------|

| Task requirements change frequently | Consistent, stable task |

| Small amount of examples (<100) | Large dataset available (1K+) |

| Prototyping and experimentation | Production, high-volume |

| Token budget not critical | Token efficiency matters |

| Don't want training overhead | Can afford training compute |

ICL Scaling: Many-Shot ICL

Recent trend: using many-shot ICL with very long context windows (Gemini 1M tokens):

Provide hundreds or thousands of examples in the context
Approaches fine-tuning quality without training
Especially powerful for tasks with highly consistent format

Related Concepts

Few-Shot, Zero-Shot, Chain of Thought, Emergent Abilities, Context Window, Attention