2. 02. Context window and prompt cache — Claude Code Guide

The most common reason “Claude suddenly got dumber” is a full context window. The most common reason “it suddenly got expensive” is lost cache. This chapter is about how to avoid both.

2.1. What is a context window

Context window — the maximum number of tokens the model sees in a single request. This includes both input (everything you send it) and space for output (what it will write).

Limits as of 23.04.2026:

Model	Standard window	1M mode
Claude Opus 4.7	200k	✅ via alias `opus[1m]`
Claude Opus 4.6	200k	✅ via alias `opus[1m]`
Claude Sonnet 4.6	200k	✅ via alias `sonnet[1m]`
Claude Haiku 4.5	200k	❌

📘 From docs (model-config#extended-context): “Opus 4.7, Opus 4.6, and Sonnet 4.6 support a 1 million token context window”.

Enable 1M on Max/Team/Enterprise:

Opus 1M is included in the subscription.
Sonnet 1M comes as extra usage (additional charge).
Can be disabled via env: CLAUDE_CODE_DISABLE_1M_CONTEXT=1.

⚠️ opusplan mode (described below) does NOT support 1M window — even if you enabled sonnet[1m], the plan phase with Opus will run in standard 200k.

2.2. What makes up the context

Each request contains several layers. You can see them in the CLI with the /context command:

📘 /context (from docs commands): “Visualize current context usage as a colored grid. Shows optimization suggestions for context-heavy tools, memory bloat, and capacity warnings”.

What usually takes up the most tokens in a real session:

Tool results — especially Read of large files and Bash with verbose output. The leader in “consuming” the window.
Large CLAUDE.md — if you put README, ADRs, and changelog there “just in case”.
Long history — every previous tool call with its result stays in the window.
Tool definitions — definitions of all tools (including MCP) can weigh 3-15k. Especially if you have 5+ MCP servers connected with dozens of tools each.

💡 Before a complex task, run /context — you’ll see the breakdown and understand what to cut.

2.3. Prompt cache: how it works

The Anthropic API supports prompt caching — a mechanism where a repeating prefix of a request is not recalculated from scratch.

📘 From platform docs (build-with-claude/prompt-caching): “By default, the cache has a 5-minute lifetime”. You can explicitly set TTL = 1 hour, but this costs 2× the input price on write (read stays the same).

Cache hierarchy (what counts as a “prefix”):

If you change an earlier layer — all later ones are also recalculated. For example, you add one MCP server → tools changes → the entire cache is invalidated.

How much cache vs no-cache costs (example: Sonnet 4.6, $3 per 1M input tokens):

Operation	Price	When
Cache write (5min TTL)	1.25 × input = $3.75 / 1M tokens	First request with this prefix
Cache read (hit)	0.10 × input = $0.30 / 1M tokens	All subsequent within TTL window
Cache write (1h TTL)	2 × input = $6 / 1M tokens	If you requested `cache_control.ttl="1h"`
No-cache (regular input)	$3 / 1M tokens	If there’s no cache at all

Savings calculation in a real Travel Agent session:

Say system + CLAUDE.md + tools = 25k tokens. You make 10 requests in 5 minutes.

Without cache: 10 × 25k × $3/M = **$ 0.75**
With cache: 1 × 25k × $3.75/M (write) + 9 × 25k ×$ 0.30/M (read) = $0.094 +$ 0.067 = $0.16

Savings — 80%. This is why prompt cache is a must-have, and losing it is a real pain.

2.4. When cache breaks

Cache is invalidated (or expires) when:

Event	What happens	How to avoid
5 minutes pass without requests	TTL expires, next request — write cache again	Raise TTL to 1h (`cache_control.ttl: "1h"`) or work without pauses
You change `tools` (add MCP, plugin)	Cache invalidated at tools level	Don’t connect MCP in the middle of a session
You change `system` (edit CLAUDE.md, add skill)	Cache invalidated at system level	Finalize CLAUDE.md before starting work
You switch models via `/model`	Cache is specific to each model	Use `opusplan` (it manages switching) or start a new session
`/clear` or new session	Prefix is recreated from scratch	This is normal, cache write is a one-time cost
`/compact`	Old history replaced with summary, then new prefix for messages	Also normal, saves window at cost of one cache write

⚠️ Myth: “switching models breaks prompt cache forever”. Reality: the next request will be a cache miss (costs more once), then everything caches again on the new model.

📘 From docs /model: “opens a picker that asks for confirmation when the conversation has prior output, since the next response re-reads the full history without cached context”.

That is, model switching is a one-time extra cost. Not “forever” and not “impossible”. Just account for it.

2.5. `opusplan` — the feature that’s wrongly called “Advisor mode”

In the thread we started with, “Advisor mode” was mentioned. In official docs, there’s no such feature. The real feature is called opusplan.

📘 From docs model-config#opusplan-model-setting: “Special mode that uses opus during plan mode, then switches to sonnet for execution”.

# В CLI
/model opusplan

What happens:

Limitations of opusplan:

⚠️ Opus phase runs in standard 200k, even if you enabled sonnet[1m].
⚠️ The switch is still two different models → one cache miss at the boundary (but then each model caches separately).
💡 Good for architectural tasks: “plan a big refactor” → detailed plan from Opus → cheap execution by Sonnet.

🔧 For Travel Agent: enable opusplan when rewriting Amadeus integration or changing DB schema. Not needed for routine React component fixes.

2.6. `/compact`, `/clear`, and auto-compaction

/compact [hint] — Claude retells the current session in brief form, replacing long history with a summary. hint — what to especially preserve.

/compact "сохрани решения по архитектуре MCP-серверов и текущий список открытых TODO"

/clear — full reset. Keeps only system + CLAUDE.md. Message cache is completely lost.

Auto-compaction — the harness runs /compact itself when the window fills above a threshold (default ~95%).

📘 Controlled by env variable CLAUDE_AUTOCOMPACT_PCT_OVERRIDE (1–100). Not CLAUDE_CODE_AUTO_COMPACT_THRESHOLD, as sometimes written.

# В .zshrc / .bashrc
export CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=80   # запускать компакцию на 80% заполнения

There’s also CLAUDE_CODE_AUTO_COMPACT_WINDOW — lets you “lie” to the harness about window size (e.g., on 1M models count as 500k so quality doesn’t drop). From practice:

export CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000  # держать сессию в "виртуальных" 400k

⚠️ Empirical, not from docs: many practitioners (including the original Twitter thread) claim that “after 300–400k quality drops”. Anthropic has no public benchmarks on this, but the symptoms are familiar: the model starts forgetting early decisions, contradicting itself, re-reading the same files. If you hit 400k — seriously think about /compact or a new session.

2.7. When to `/compact`, when to `/clear`, when to start a new session

Rule of thumb for Travel Agent:

Finishing one React component → done → /clear before the next.
Long debugging of one feature (several hours) → /compact "keep context about SSE stream and current bug".
Switching from frontend to backend → new session (different contexts, little in common).

2.8. Checklist: “how not to burn the window and cache”

✅ Before a long task, run /context — assess starting fill. ✅ Keep CLAUDE.md ≤ 5k tokens. Larger — split into subdirectory CLAUDE.md (see 03). ✅ Connect MCP servers before starting work, not in the middle. ✅ If a task lasts > 5 minutes with pauses — ask harness to use 1h TTL (settings flag or pass cache_control.ttl="1h" via SDK). ✅ Don’t Read entire huge files (50k-line logs) — use Grep or offset/limit. ✅ After each completed task — /clear. ✅ For architectural decisions enable opusplan (plan by Opus, execution by Sonnet). ✅ If you catch “model is dumb after long session” — do /compact or start fresh.

⚠️ What NOT to do: ❌ Use 1M window “because it exists” — it’s both expensive and worse quality. ❌ Connect all MCP servers “just in case” — each bloats tools. ❌ Dump entire README, license, changelog, and dependency list into CLAUDE.md. ❌ Switch models mid-complex-task without reason (or use opusplan).

Next → 03. CLAUDE.md: levels, imports, auto-memory

2.1. What is a context window#

2.2. What makes up the context#

2.3. Prompt cache: how it works#

2.4. When cache breaks#

2.5. opusplan — the feature that’s wrongly called “Advisor mode”#

2.6. /compact, /clear, and auto-compaction#

2.7. When to /compact, when to /clear, when to start a new session#

2.8. Checklist: “how not to burn the window and cache”#