Advanced·4 min read

Prompt Caching

Prompt caching is an optimization where the LLM provider precomputes and stores the KV (key-value) cache for a repeated portion of the prompt — typica

Definition

Prompt caching is an optimization where the LLM provider precomputes and stores the KV (key-value) cache for a repeated portion of the prompt — typically the system prompt or large context documents — so that subsequent requests reuse this computation instead of reprocessing it from scratch.

The Problem It Solves

In production applications, many requests share the same large prefix:

  • A 10,000-token system prompt sent with every API call
  • A 200-page document sent with every question about it
  • A large codebase sent with every code review request
  • Without caching: pay full input token cost + full prefill compute on EVERY request.

    With caching: pay full cost ONCE, then ~10% of input cost on cache hits.

    Cost and Latency Savings

    | Metric | Without Cache | With Cache (hit) |

    |--------|--------------|-----------------|

    | Input token cost | 100% | ~10% of cached portion |

    | Prefill latency | Full compute | ~85% reduction |

    | TTFT | Baseline | Much lower on cache hits |

    Example: 10,000-token system prompt at $3/M tokens:

  • Without caching: $0.03 per request × 10,000 requests = $300
  • With caching: $0.03 first request + $0.003 × 9,999 remaining = $33
  • How It Works Technically

    1. On the first request, the provider processes the entire prompt and stores KV states for the cacheable prefix

    2. On subsequent requests with the same prefix, the stored KV states are retrieved

    3. Only the new portion (user query + any changed parts) is computed

    4. The response is generated using the cached + new KV states

    The cache is keyed on the exact byte sequence — any change to the cached portion invalidates the cache.

    Provider Implementations

    Anthropic Claude (Explicit Cache Control)

    Developers explicitly mark what to cache using cache_control:

    `python

    import anthropic

    client = anthropic.Anthropic()

    response = client.messages.create(

    model="claude-sonnet-4-6",

    max_tokens=1024,

    system=[

    {

    "type": "text",

    "text": "You are an expert code reviewer. [10,000 token instructions]",

    "cache_control": {"type": "ephemeral"} # ← cache this

    }

    ],

    messages=[

    {

    "role": "user",

    "content": [

    {

    "type": "text",

    "text": "[Large codebase - 50,000 tokens]",

    "cache_control": {"type": "ephemeral"} # ← cache this too

    },

    {"type": "text", "text": "Review the authentication module."}

    ]

    }

    ]

    )

    Check cache usage

    usage = response.usage

    print(f"Cache read tokens: {usage.cache_read_input_tokens}")

    print(f"Cache write tokens: {usage.cache_creation_input_tokens}")

    `

    Cache pricing (Claude):

  • Cache write: 25% more than base input price (one-time cost)
  • Cache read: 10% of base input price (paid on every cache hit)
  • OpenAI (Automatic)

    OpenAI automatically caches prompts when the same prefix appears in multiple requests:

  • No explicit opt-in required
  • 50% discount on cached input tokens
  • Cache has a TTL (5–60 minutes)
  • Works best for system prompts and long repeated contexts
  • Google Gemini (Explicit Context Caching)

    `python

    import google.generativeai as genai

    cached_content = genai.caching.CachedContent.create(

    model="gemini-1.5-pro",

    contents=[large_document],

    ttl=datetime.timedelta(hours=1)

    )

    model = genai.GenerativeModel.from_cached_content(cached_content)

    response = model.generate_content("What is the main finding?")

    `

    What to Cache (Best Practices)

    High Value to Cache

    | Content | Reason |

    |---------|--------|

    | System prompts > 1,000 tokens | Reused every request |

    | Large reference documents | Sent with every Q&A request |

    | Codebase for code review | Many questions about same code |

    | Conversation history (long) | Grows with conversation |

    | Tool/function definitions | Same set for all requests |

    Cannot/Should Not Cache

    | Content | Reason |

    |---------|--------|

    | User-specific content | Changes per user |

    | Dynamic data (today's prices, live info) | Changes over time |

    | Request-specific context | Unique per request |

    | Short system prompts (<1K tokens) | Overhead not worth it |

    Cache Invalidation

    The cache is invalidated when:

  • The cached portion changes even by one character
  • The cache TTL expires (5 minutes for OpenAI, configurable for Gemini)
  • API version changes affect tokenization
  • Best practice: structure your prompt so the stable portion comes FIRST:

    `

    [Large stable system prompt] ← cache this

    [Large stable document] ← cache this

    [Dynamic user query] ← NOT cached (changes each request)

    `

    Compound Effect: Multi-Turn with Caching

    In a long conversation, cache the system prompt + conversation history up to the current turn:

    `

    Turn 1: cache system (10K tokens) + user message

    Turn 2: read cache + turn 1 + user message

    Turn 3: read cache + turns 1-2 + user message

    ...

    Turn N: 90% cost reduction on the system prompt portion for every turn

    `

    Related Concepts

  • KV Cache, Context Window, Latency, Token Costs, System Prompt, API

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 13).