Lesson 02 / 14

02. Context window and prompt cache

Context window is both money and quality. Opus/Sonnet 4.6 standard — 200k tokens; 1M available via alias opus[1m]/sonnet[1m]. Prompt cache cuts the bill 10x, but lives only 5 minutes by default.

The most common reason “Claude suddenly got dumber” is a full context window. The most common reason “it suddenly got expensive” is lost cache. This chapter is about how to avoid both.


2.1. What is a context window

Context window — the maximum number of tokens the model sees in a single request. This includes both input (everything you send it) and space for output (what it will write).

Limits as of 23.04.2026:

ModelStandard window1M mode
Claude Opus 4.7200k✅ via alias opus[1m]
Claude Opus 4.6200k✅ via alias opus[1m]
Claude Sonnet 4.6200k✅ via alias sonnet[1m]
Claude Haiku 4.5200k

📘 From docs (model-config#extended-context): “Opus 4.7, Opus 4.6, and Sonnet 4.6 support a 1 million token context window”.

Enable 1M on Max/Team/Enterprise:

  • Opus 1M is included in the subscription.
  • Sonnet 1M comes as extra usage (additional charge).
  • Can be disabled via env: CLAUDE_CODE_DISABLE_1M_CONTEXT=1.

⚠️ opusplan mode (described below) does NOT support 1M window — even if you enabled sonnet[1m], the plan phase with Opus will run in standard 200k.


2.2. What makes up the context

Each request contains several layers. You can see them in the CLI with the /context command:

📘 /context (from docs commands): “Visualize current context usage as a colored grid. Shows optimization suggestions for context-heavy tools, memory bloat, and capacity warnings”.

What usually takes up the most tokens in a real session:

  1. Tool results — especially Read of large files and Bash with verbose output. The leader in “consuming” the window.
  2. Large CLAUDE.md — if you put README, ADRs, and changelog there “just in case”.
  3. Long history — every previous tool call with its result stays in the window.
  4. Tool definitions — definitions of all tools (including MCP) can weigh 3-15k. Especially if you have 5+ MCP servers connected with dozens of tools each.

💡 Before a complex task, run /context — you’ll see the breakdown and understand what to cut.


2.3. Prompt cache: how it works

The Anthropic API supports prompt caching — a mechanism where a repeating prefix of a request is not recalculated from scratch.

📘 From platform docs (build-with-claude/prompt-caching): “By default, the cache has a 5-minute lifetime”. You can explicitly set TTL = 1 hour, but this costs 2× the input price on write (read stays the same).

Cache hierarchy (what counts as a “prefix”):

If you change an earlier layer — all later ones are also recalculated. For example, you add one MCP server → tools changes → the entire cache is invalidated.

How much cache vs no-cache costs (example: Sonnet 4.6, $3 per 1M input tokens):

OperationPriceWhen
Cache write (5min TTL)1.25 × input = $3.75 / 1M tokensFirst request with this prefix
Cache read (hit)0.10 × input = $0.30 / 1M tokensAll subsequent within TTL window
Cache write (1h TTL)2 × input = $6 / 1M tokensIf you requested cache_control.ttl="1h"
No-cache (regular input)$3 / 1M tokensIf there’s no cache at all

Savings calculation in a real Travel Agent session:

Say system + CLAUDE.md + tools = 25k tokens. You make 10 requests in 5 minutes.

  • Without cache: 10 × 25k × 3/M=3/M = **0.75**
  • With cache: 1 × 25k × 3.75/M(write)+9×25k×3.75/M (write) + 9 × 25k × 0.30/M (read) = 0.094+0.094 + 0.067 = $0.16

Savings — 80%. This is why prompt cache is a must-have, and losing it is a real pain.


2.4. When cache breaks

Cache is invalidated (or expires) when:

EventWhat happensHow to avoid
5 minutes pass without requestsTTL expires, next request — write cache againRaise TTL to 1h (cache_control.ttl: "1h") or work without pauses
You change tools (add MCP, plugin)Cache invalidated at tools levelDon’t connect MCP in the middle of a session
You change system (edit CLAUDE.md, add skill)Cache invalidated at system levelFinalize CLAUDE.md before starting work
You switch models via /modelCache is specific to each modelUse opusplan (it manages switching) or start a new session
/clear or new sessionPrefix is recreated from scratchThis is normal, cache write is a one-time cost
/compactOld history replaced with summary, then new prefix for messagesAlso normal, saves window at cost of one cache write

⚠️ Myth: “switching models breaks prompt cache forever”. Reality: the next request will be a cache miss (costs more once), then everything caches again on the new model.

📘 From docs /model: “opens a picker that asks for confirmation when the conversation has prior output, since the next response re-reads the full history without cached context”.

That is, model switching is a one-time extra cost. Not “forever” and not “impossible”. Just account for it.


2.5. opusplan — the feature that’s wrongly called “Advisor mode”

In the thread we started with, “Advisor mode” was mentioned. In official docs, there’s no such feature. The real feature is called opusplan.

📘 From docs model-config#opusplan-model-setting: “Special mode that uses opus during plan mode, then switches to sonnet for execution”.

# В CLI
/model opusplan

What happens:

Limitations of opusplan:

  • ⚠️ Opus phase runs in standard 200k, even if you enabled sonnet[1m].
  • ⚠️ The switch is still two different models → one cache miss at the boundary (but then each model caches separately).
  • 💡 Good for architectural tasks: “plan a big refactor” → detailed plan from Opus → cheap execution by Sonnet.

🔧 For Travel Agent: enable opusplan when rewriting Amadeus integration or changing DB schema. Not needed for routine React component fixes.


2.6. /compact, /clear, and auto-compaction

/compact [hint] — Claude retells the current session in brief form, replacing long history with a summary. hint — what to especially preserve.

/compact "сохрани решения по архитектуре MCP-серверов и текущий список открытых TODO"

/clear — full reset. Keeps only system + CLAUDE.md. Message cache is completely lost.

Auto-compaction — the harness runs /compact itself when the window fills above a threshold (default ~95%).

📘 Controlled by env variable CLAUDE_AUTOCOMPACT_PCT_OVERRIDE (1–100). Not CLAUDE_CODE_AUTO_COMPACT_THRESHOLD, as sometimes written.

# В .zshrc / .bashrc
export CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=80   # запускать компакцию на 80% заполнения

There’s also CLAUDE_CODE_AUTO_COMPACT_WINDOW — lets you “lie” to the harness about window size (e.g., on 1M models count as 500k so quality doesn’t drop). From practice:

export CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000  # держать сессию в "виртуальных" 400k

⚠️ Empirical, not from docs: many practitioners (including the original Twitter thread) claim that “after 300–400k quality drops”. Anthropic has no public benchmarks on this, but the symptoms are familiar: the model starts forgetting early decisions, contradicting itself, re-reading the same files. If you hit 400k — seriously think about /compact or a new session.


2.7. When to /compact, when to /clear, when to start a new session

Rule of thumb for Travel Agent:

  • Finishing one React component → done → /clear before the next.
  • Long debugging of one feature (several hours) → /compact "keep context about SSE stream and current bug".
  • Switching from frontend to backend → new session (different contexts, little in common).

2.8. Checklist: “how not to burn the window and cache”

✅ Before a long task, run /context — assess starting fill. ✅ Keep CLAUDE.md ≤ 5k tokens. Larger — split into subdirectory CLAUDE.md (see 03). ✅ Connect MCP servers before starting work, not in the middle. ✅ If a task lasts > 5 minutes with pauses — ask harness to use 1h TTL (settings flag or pass cache_control.ttl="1h" via SDK). ✅ Don’t Read entire huge files (50k-line logs) — use Grep or offset/limit. ✅ After each completed task — /clear. ✅ For architectural decisions enable opusplan (plan by Opus, execution by Sonnet). ✅ If you catch “model is dumb after long session” — do /compact or start fresh.

⚠️ What NOT to do: ❌ Use 1M window “because it exists” — it’s both expensive and worse quality. ❌ Connect all MCP servers “just in case” — each bloats tools. ❌ Dump entire README, license, changelog, and dependency list into CLAUDE.md. ❌ Switch models mid-complex-task without reason (or use opusplan).


Next → 03. CLAUDE.md: levels, imports, auto-memory