Your context window is not free

Something changed in your setup last month. You installed a few MCP servers (GitHub, maybe a database tool, a browser automation one because someone on Twitter said it was incredible) and your agent has more capabilities than ever. It also takes longer to respond and makes worse decisions than it did before you gave it all this power, and you haven't connected the two things yet.

The GitHub MCP server alone loads roughly 55,000 tokens (as measured in early 2026) of tool schemas into your context window at session start. That's the full menu - tool names, parameter descriptions, type definitions, JSON schemas - read by the agent before it can order anything. If you've added four or five MCP servers, you might be spending 100,000+ tokens on tool definitions before the agent writes a single line of code. On some models, that's half the context window gone on a table of contents.

This isn't an MCP hit piece; MCP solves real problems that needed solving. But the token cost is real too, and most people aren't thinking about it. (A caveat on the numbers: the GitHub MCP server is a known worst-case outlier - it was written early, exposes 90+ tools, and nobody optimized for schema size. Most production MCP servers are leaner. But the pattern holds even at smaller scales when you stack multiple servers.)

Three ways to give your agent capabilities

There are three patterns for connecting agents to the outside world. They're different layers, not competing products, though the ecosystem treats them that way.

CLI tools are shell commands the agent invokes through Bash. Your agent already knows git, grep, curl, docker, gh, and dozens of vendor CLIs from its training data. The tool definition lives in the model's weights, not in your context window - zero tokens until the command runs, then you pay only for the output.

MCP (the Model Context Protocol) is Anthropic's open standard for connecting agents to external tools and services. Each MCP server publishes a structured schema so the agent knows what's available and how to call it. This solves problems CLI can't: OAuth flows across multiple users, stateful workflows where the agent needs structured input and output rather than text piped through stdout. The schemas also carry real value beyond just capability - the agent can reason about types and error contracts precisely, which matters when you need it to use a tool correctly on the first try. But every schema lives in your context window for the duration of the session, whether the agent calls that tool or not. The ecosystem is aware of this and actively working on it.

Then there are skills and rules files. CLAUDE.md, SKILL.md, .cursorrules, system prompts, Engraph constraints - these tell the agent how to approach work, not what tools it has. Use the gh CLI for GitHub operations is a skill. Never retry payment webhooks without idempotency keys is a constraint. Neither executes anything. They shape how the agent thinks about what it executes.

CLI and MCP handle execution while skills and constraints handle knowledge. The token cost conversation has focused almost entirely on the execution layer, and that's where the visible waste is, which means the knowledge layer's own budget problems go largely unexamined, even as teams pile more rules and context into every session.

Your context window is a budget

The industry talks about context windows like storage - how many tokens fit. That framing is misleading. Context windows have an attention gradient. The model doesn't treat all tokens equally. Instructions at the beginning and end get more weight than what's in the middle. Load 100,000 tokens of tool schemas at the top and the rules you wrote in your CLAUDE.md get pushed into the zone where attention is weakest.

Every token in your context window is a prioritization decision. Not "does it fit" but "what does it displace." The 55,000 tokens the GitHub MCP server occupies are pushing your rules and conversation history into lower-attention positions where the model is more likely to ignore them.

Multi-step reasoning breaks down when this happens. Research on MCP token overhead has documented agents degrading after three or four tool calls because accumulated context pushes the agent toward the tail of its window where attention quality drops. The Chrome DevTools MCP alone burns roughly 17,000 tokens on definitions. Stack a few servers and you're starting every conversation with the agent equivalent of reading the phone book before getting to work.

This isn't a theoretical complaint for me - we hit this wall building Engraph. Early on we considered delivering constraints via an MCP tool that the agent would call to fetch rules. The problem was obvious in testing - by the time the agent called the tool, the constraints arrived in a part of the context where they competed with everything else for attention. I remember the specific moment it clicked: we had a constraint about never running database migrations without a rollback plan, and the agent blew right past it three sessions in a row. Same constraint, delivered via a SessionStart hook instead of an MCP tool call, got followed every time. Hooks that inject text at session start and at file-touch boundaries turned out to work better, not because they're cheaper (they're comparable in token cost) but because they land in higher-attention positions. That's an anecdote from one constraint across a handful of sessions, not a controlled study. I should be upfront: I build a product (Engraph, more on the architecture below) based on the hypothesis that delivery position matters for constraint compliance, and our evidence for that hypothesis is thinner than our conviction. But it's why the framing matters to me personally and why I think the conversation about token cost needs to extend past tool schemas into the rules and constraints that actually shape what agents do with those tools.

The ecosystem is already fixing the execution tax

The raw overhead problem is shrinking fast, and the fixes are already shipping.

Anthropic shipped two distinct mechanisms that address different parts of the problem. Deferred tool loading keeps tool descriptions compact until the agent actually needs the full schema, cutting per-tool overhead by roughly 85%. On top of that, Claude Code's MCP Tool Search adds lazy loading at the discovery layer - tools only fully materialize when the agent decides to use them, which Anthropic reports reduces total context usage by up to 95%. The first compresses individual schemas; the second prevents unused schemas from loading at all.

Outside Anthropic's ecosystem, mcp2cli takes a more radical approach: it converts MCP servers to CLI tools at runtime and bypasses the schema-in-context model altogether. It claims 96-99% token reduction and hit the front page of Hacker News earlier this month. Denis Yarats, Perplexity's CTO, publicly moved away from MCP toward APIs and CLIs at Ask 2026. Context overhead was his core complaint.

If you're running Claude Code and haven't enabled deferred tool loading, that's the single highest-leverage thing you can do today. But even with a 95% reduction in schema overhead, these solutions only address the execution layer - tool definitions getting leaner. The knowledge layer, how your agent decides what to do rather than what it can do, hasn't gotten any of this attention.

What your tokens buy

The obvious conclusion from "MCP costs tokens" is "minimize tokens" - use CLI instead, strip out MCP servers. That's half right, but it treats all tokens as equal waste to be eliminated, and they're not.

The GitHub MCP schema costs 55,000 tokens, and your agent already knows the gh CLI from training data. Those tokens buy a structured protocol layer to describe tools the model already knows how to use - overhead you can cut for most GitHub operations without losing anything. But the same protocol for a proprietary internal API with no CLI and OAuth requirements? The agent can't call it any other way. Those tokens are buying capability that doesn't exist without them, plus the type safety and contract precision that prevent the agent from calling it wrong.

The part of the budget nobody audits is the knowledge layer. A 200-token constraint that says this service caches auth tokens in memory, you can't scale it horizontally without a session store costs almost nothing. But it redirects a generation that would have looked correct, passed review, and broken in production under load. That spend isn't buying the agent more tools, it's making the tools it already has produce better output. This is the difference between tokens that describe capability and tokens that shape judgment - two completely different kinds of context spend that get lumped together when people treat every token as waste to be eliminated.

The practical version

You don't need to rearchitect anything - start by looking at what's actually in your context window.

If you're on Claude Code, check your MCP configuration. For each server, ask: does a mature CLI exist that the model already knows? If you're running the GitHub MCP but your agent can already use gh, you're spending tokens on capability you already had. Same for Docker and the major cloud CLIs. Disable it and see if anything breaks - you might be surprised at how little changes.

Keep MCP where the tokens are justified. If the tool has no CLI and the agent can't reach it any other way, the schema tokens are justified - and that includes anything requiring OAuth or dynamic auth across multiple users.

If you're building MCP servers, token cost is a design surface you control regardless of what runtime optimizations the client implements. Expose fewer, more composable tools rather than one tool per API endpoint. The GitHub MCP server's ninety-plus tools are the anti-pattern here - most of them are thin wrappers around REST endpoints that the agent could compose from a smaller set of primitives. A server with eight well-designed tools that combine through parameters costs a fraction of the context and the agent makes fewer mistakes when the option space is smaller.

Schema design matters too: tighter type definitions and fewer optional parameters reduce the per-tool token cost, as do descriptions that say what the tool does rather than documenting every edge case. There's a real tension here, because schema richness is what makes MCP worth using - cutting too aggressively loses the type contracts that help the agent call tools correctly. The heuristic that's worked for us is that schema richness earns its cost when the model has no prior knowledge of the tool. A proprietary API with non-obvious parameter semantics needs every type annotation you can give it. When the tool mirrors a well-known CLI or REST pattern the model already understands, leaner descriptions work fine because the model fills in the gaps from training data.

For knowledge delivery, think about what's actually shaping your agent's decisions. A 5,000-line CLAUDE.md loaded at session start is the knowledge equivalent of a bloated MCP server - most of it irrelevant to what the agent is doing right now, all of it competing for attention. If you've ever wondered why your agent follows the first ten rules in your CLAUDE.md religiously and seems to forget rule forty-seven exists, the attention gradient is probably why, and it's worth asking whether that 5,000-line file is earning its keep or just occupying prime real estate in your context window while the agent ignores most of it.

There's also a subtler budget problem that has nothing to do with size. The agent's system prompt occupies the most privileged position in your context window, and some of what it contains works against the rules you wrote. Users have documented how Claude Code's built-in output-efficiency directives ("lead with the answer, not the reasoning," "if you can say it in one sentence, don't use three") suppress the model's self-checking loop on the same turn it's supposed to be following your architectural constraints. The result is a specific inversion: the agent asks permission for trivial operations it could handle autonomously, because the rules for those are clear and positionally strong, while skipping the reasoning steps that would catch conflicts with your CLAUDE.md rules about how to build things. More caution where you don't need it, less where you do. That's not a token cost you can optimize away by writing a shorter rules file. It's a conflict between two sets of instructions that occupy different tiers of the attention hierarchy.

Where we sit in this

I build Engraph, a system that delivers organizational constraints to AI agents so teams can govern agent behavior across sessions and people. What Engraph does is directly relevant to this framing, and I'd rather explain our architecture than have someone infer it.

Engraph delivers organizational constraints via hooks - SessionStart injects project-level rules, and PreToolUse injects subsystem-specific constraints when the agent touches relevant files. That's text injection into the context window, and it costs tokens. Whether hook-injected constraints outperform MCP-delivered ones by enough to matter is something we can't yet quantify. The delivery is designed around the budget problem regardless. Constraints are ranked by relevance to the current session and scoped to subsystems so you don't get database rules when you're editing a blog post. There's a token budget that caps how much we inject. We also use MCP for the interactive parts of the workflow (looking up constraints, flagging corrections) because those are structured request/response interactions where MCP makes sense. We picked the delivery mechanism based on what each part of the system actually needed.

The bet is that a constraint costing 50 tokens to deliver prevents a generation that would have cost you hours to find and fix in review. We measure correction rates dropping as constraint sets grow - but that requires constraints to exist first. A team starting from zero has no constraints to deliver, and the system's value is exactly zero until someone captures the first correction or imports rules from an existing CLAUDE.md. The budget framing only matters once you have something worth budgeting for. Whether the correction rates hold outside our own codebase, we don't know yet, and we'll publish the numbers either way.

What changes the math

Context windows might get big enough that the budget framing stops mattering. If we get to two million tokens with flat attention and no degradation in the middle, then 55,000 tokens of MCP schemas is a rounding error. Load everything. But bigger windows haven't meant better attention so far, and betting on architecture changes to solve a problem you can address today is a familiar kind of procrastination.

The ecosystem fixes I described above might also close the gap entirely. If progressive discovery means MCP schemas only cost tokens when they're actually used, the CLI advantage narrows to training-data familiarity - whether the model already knows the tool well enough to skip the schema. At that point the debate shifts from "how much does your tool cost to describe" to "how well does the model know this tool already." I find that a more interesting question than the one we're having now, and I suspect we'll get there faster than most people expect.

Related: Your agents don't read the middle examines the attention gradient problem in detail. The context engineering blind spot looks at what happens when authorised rules go stale in the context window.