TL;DR
Learn how to substantially cut context token usage without losing quality. This guide covers the fundamentals of context management in AI-powered development: why LLMs are stateless, how context grows naturally over time, and four battle-tested strategies to keep your AI workflows efficient. Master conversation management, sub-agents, external knowledge systems, and context layering to build AI systems that scale.
The Context Crisis
It's 3 PM and your agent failed again — after reading irrelevant files, burning 180k tokens, and producing broken code. More context made results worse because the model drowned in noise. This is the context crisis: developers assume more context = better performance, but that's wrong.
In our last article on zero‑config Claude Code plugins in Devcontainers, we showed how we built a language-agnostic plugin system. The key insight? Convention over configuration through context separation. We separated generic plugin instructions from project-specific conventions, loading only what was needed when it was needed. This wasn't just a design preference — it was a fundamental requirement driven by how AI context actually works.
But what exactly is context, and why does it matter so much? Let's dive deep into the mechanics.
What is Context? Understanding AI's Working Memory
The Fundamental Truth: LLMs Are Stateless
This is critical to understand: Every single message you send includes the ENTIRE context from scratch.
The LLM doesn't "remember" your previous conversation. It doesn't have persistent memory. Each request includes the full context assembled by the client: system prompt, tools, memory files, message history, and your new input. The model processes that request and returns only a response. Then it forgets everything.
Think of it like calling a stateless API endpoint. Every request is independent. When you send your next message, you're not continuing a conversation — you're starting fresh, but with the previous exchange included in the new request.
What This Means:
Message 1: Send 50k tokens → Get response
Message 2: Send 50k tokens + previous exchange (~55k total) → Get response
Message 10: Send 50k tokens + all previous exchanges (~120k total) → Getting tight!
The Growing Context Problem
Context naturally grows over time. Every message adds to the history. Every tool result adds more data. Every file you reference expands the context window.
For Sonnet 4/4.5, the usual context window is ~200k tokens — and that's a hard ceiling for those models. Think of it like RAM for a computer. Other models/versions may differ.
Context Anatomy: What's Taking Up Space?
Here's a real example from Claude Code's /context command (Sonnet 4.5 with a ~200k-token window; values vary by model/version):
Context Usage
claude-sonnet-4-5-20250929 · 77k/200k tokens (38%)
System prompt: 2.5k tokens (1.3%)
System tools: 13.4k tokens (6.7%)
MCP tools: 3.0k tokens (1.5%)
Custom agents: 161 tokens (0.1%)
Memory files: 6.5k tokens (3.3%)
Messages: 6.2k tokens (3.1%)
Free space: 123k (61.6%)
Autocompact buffer: 45.0k tokens (22.5%)
This is a healthy context profile. 61% free space means there's plenty of room for the AI to work.
How Data Enters Context
Context grows through four primary mechanisms:
Baseline Loading: System prompt, tools, memory files, MCP servers - always present
Conversation Accumulation: Message history grows with each turn - the key growth factor
Dynamic Loading: Reading files, pasted text, tool outputs, searching codebase - added during conversation
Every action adds to context and accumulates across turns - what you send is exactly what the LLM sees.
The Cost of Context Pollution
Context pollution isn't just an annoyance - it's a performance killer. When your context window fills with irrelevant information, everything degrades.
Performance and attention: More tokens cause slow processing and dilute attention across noise, obscuring the few tokens that actually matter.
Token economics: With a 200k limit, if your baseline setup uses 80k tokens, you only have 120k left for actual reasoning and file analysis.
Latency: Larger prompts increase response time, breaking flow and productivity.
Cost: As we discussed above, LLMs are stateless - every message resends the entire context. Larger context means higher costs per request, even with caching (cache reads aren't free, and new content isn't cached).
Context pollution is expensive. Every irrelevant token costs you speed, accuracy, reliability, and money. Optimize ruthlessly.
Context Observability: Debugging Your Context
You can't optimize what you can't measure. Context observability is critical for understanding where your tokens are going and identifying optimization opportunities.
The /context Command
Claude Code provides the /context command for real-time visibility into token distribution. This is your context dashboard - use it frequently to monitor system prompt, tools, memory files, messages, and free space.
Best Practices
Focus on context quality over quantity. You could be at 90% free space with irrelevant content and get poor results, or at 40% with highly relevant context and get excellent results.
Proactive Context Management:
Run
/contextat the start of each major task to establish baselineIf already at 80%+ usage, use
/clearor/compactbefore starting new workDisable unused MCP servers immediately with
/mcp(usually saves 5-15k tokens per server, depending on the number of tools and their description complexity)Use
/clearwhen switching between unrelated tasks (fresh start)Offload specialized work to sub-agents to keep main context lean
Monitor trends - if free space decreases rapidly, investigate what's consuming tokens (large files? verbose tool outputs? unused MCP servers?)
Don't let context pressure build until it becomes a problem. Manage proactively, not reactively.
Context Management Strategies
Now that we understand what context is and why it matters, let's explore practical strategies for managing it effectively. These aren't theoretical concepts - they're battle-tested patterns from building production AI workflows at Zenity.
Strategy 1: Context Pruning & Conversation Management
Pattern: Actively remove irrelevant history.
Long conversations accumulate cruft. Early messages become irrelevant as the conversation evolves. Prune aggressively.
Implementation:
Use /clear for fresh starts when switching tasks, or /compact to compress while preserving key context (you can add optional instructions like /compact focus on the API changes). Design workflows that complete in fewer conversational turns to minimize accumulation.
Understanding Autocompact:
Claude Code automatically compresses context at ~155k tokens (~77.5% usage) for Sonnet 4/4.5, leaving a 45k safety buffer. You can disable this with /config set autocompact false but must then manually manage context with /clear or /compact before hitting the hard limit.
Strategy 2: Specialized Sub-Agents
Pattern: Delegate specific operations to context-efficient sub-agents.
Monolithic agents carry everything - all tools, all context, all capabilities. This is inefficient. Instead, create focused sub-agents that specialize in specific tasks, then delegate work to them. For example, a Utils Agent handles Git/JIRA/GitHub operations with specialized tools, completes the task, and returns only a brief summary to your main agent.
Critical Advantage: Each sub-agent runs in isolated context that's cleared on exit. For example, Utils Agent loads specialized Git/JIRA/Python tools (15-40k tokens), executes commands, and returns only essential results (~50-200 tokens) to your main conversation. Your main agent never carries that tool overhead.
Strategy 3: External Knowledge Systems
Pattern: Store large reference datasets outside the context window, query on-demand.
Not everything belongs in context - even dynamically loaded. For massive documentation sets, historical decisions, or organizational knowledge that spans thousands of pages, use external systems with semantic search or structured queries.
When to Use External Systems vs Dynamic Loading:
Dynamic Loading (Read/Grep): Project files, examples, test patterns - data already in your codebase
External Systems: Third-party library docs, company wikis, historical decisions, API references that exceed codebase scope
Use MCP servers like Context7 to fetch library documentation on-demand, or connect to Confluence/ClickUp/etc for organizational knowledge. Instead of loading 50-100k tokens of comprehensive docs, query semantically and bring in only the 2-3k tokens relevant to your current task.
Strategy 4: Context Layering with Structured Memory
Pattern: Organize context into tiers - keep essentials small, load details on demand.
The temptation is to frontload all possible information into system prompts or memory files. Resist this urge. Instead, use a three-tier architecture that keeps always-loaded context minimal while preserving access to comprehensive information.
The Three-Tier Architecture:
System Prompt (Tier 1): Generic workflow instructions (~500 tokens)
Always present, sent with every request
Keep language-agnostic and project-agnostic
Focus on process, not content
Memory Files (Tier 2): Core project conventions (~4-6k tokens)
Always present, sent with every request (LLMs are stateless)
Store in
CLAUDE.mdand related files (which can be also placed to subdirectories for complex projects)Version controlled alongside code
Focus on essential rules and patterns, not comprehensive examples
Dynamic Loading (Tier 3): Detailed documentation (10-50k tokens)
Loaded only when task requires it
Fetched via tools during conversation
Added to message history temporarily
Implementation example:
Instead of embedding everything in your system prompt (which could be 53k+ tokens), use a three-tier approach: keep your system prompt generic (~500 tokens), store essential project conventions in CLAUDE.md (~4-6k tokens), and fetch detailed documentation only when implementing specific features. This reduces your always-present context from 53k to under 7k tokens.
Real-World Application:
Our Planner agent demonstrates this approach: it always loads core conventions from CLAUDE.md (~4k tokens) but fetches language-specific documentation only when implementing features in that language. Same agent, same plugin, but only relevant detailed documentation is loaded per task.
Key Benefits:
Version Control: Track convention changes in git alongside code
Reusability: Share standards across projects and team members
Maintainability: Update conventions in one place, affects all sessions
Clarity: Separate essential rules (always loaded) from detailed examples (loaded on demand)
Efficiency: Keep always-present context minimal while maintaining access to comprehensive information
Conclusion: Context is Code
Context engineering is a core competency for AI-powered development. Just as we optimize algorithms for complexity, we must optimize AI workflows for context efficiency.
Key Takeaways
Context is finite. Manage it like memory. Every token counts.
LLMs are stateless. Every message sends the entire context from scratch.
Less is often more. Streamlined context = faster, more accurate results.
Use the right strategy. Conversation management, sub-agents, external knowledge systems, context layering - each has its place.
Monitor continuously. Run
/contextfrequently and optimize proactively.
Context engineering is the difference between AI tools that feel magical and AI tools that feel frustrating. Between agents that consistently deliver and agents that randomly fail. Between workflows that scale and workflows that collapse.
