Anthropic’s 98.7% Confession: Why Code Execution Beats MCP

The company that created the Model Context Protocol just published research showing their own protocol wastes 98.7% of tokens compared to letting agents write code.

Published November 2024 · 12 min read

📚 Want the complete ebook version?

Read the full eBook here

If you’ve built production AI agents, you’ve probably felt this pain: your agent works beautifully with 5 tools, then you add 15 more and suddenly it’s slow, confused, and burning through your token budget. You assume it’s your implementation.

It’s not. It’s the architecture.

The Admission That Changed Everything

In November 2024, Anthropic published “Code execution with MCP,” a blog post that quietly undermines their widely-adopted Model Context Protocol. The numbers aren’t just bad—they’re devastating:

150,000 → 2,000

tokens (98.7% reduction)

That’s not incremental improvement. That’s a fundamental mismatch between interface and capability.

Here’s what’s happening under the hood: MCP loads all tool definitions into the context window upfront. Connect to a dozen popular MCP servers and you’ll burn 50,000-66,000 tokens before your agent even sees the user’s question.

The GitHub Example

The GitHub MCP server alone defines 93 tools consuming 55,000 tokens. That’s nearly a third of Claude’s 200k context window gone before any real work begins.

But it gets worse. Every intermediate result flows through the model’s context. When your agent reads a document and passes it to another tool, that entire document gets loaded into context, processed, then loaded again for the next step. For a 2-hour meeting transcript, that’s 50,000+ additional tokens just shuttling data back and forth.

“Tool descriptions occupy more context window space, increasing response time and costs. In cases where agents are connected to thousands of tools, they’ll need to process hundreds of thousands of tokens before reading a request.”

— Anthropic Engineering Blog

Independent Validation

This isn’t just Anthropic noticing the problem. Cloudflare independently reached the same conclusion and built “Code Mode” around it:

“We found agents are able to handle many more tools, and more complex tools, when those tools are presented as a TypeScript API rather than directly.”

— Cloudflare Blog

When two engineering organizations with completely different tech stacks arrive at identical conclusions independently, you’re looking at a fundamental truth, not a lucky optimization.

Why This Matters

This is engineering self-correction done right. Anthropic shipped MCP, gathered data, identified limitations, and published honest findings. That deserves credit. But it also means teams betting on MCP as their primary agent architecture are building on a foundation that even its creators now recommend working around.

The Training Distribution Gap: Why Code Wins

The reason code execution outperforms tool-calling isn’t some clever optimization trick. It’s fundamental to how large language models actually work.

LLMs are pattern-matching engines trained on code syntax, not tool schemas.

During pre-training, models see:

2.5 trillion tokens of real-world code from millions of open-source projects (GitHub, Stack Overflow, documentation)
~60,000 synthetic tool-calling examples constructed specifically for training

That’s not a 10x difference. It’s not even a 100x difference.

It’s a 40,000x difference in exposure.

“Perhaps this is because LLMs have an enormous amount of real-world TypeScript in their training set, but only a small set of contrived examples of tool calls. LLMs have seen a lot of code. They have not seen a lot of ‘tool calls’.”

— Cloudflare Blog

Cloudflare’s analogy is perfect: asking an LLM to use tool-calling syntax is like “putting Shakespeare through a month-long class in Mandarin and then asking him to write a play in it. It’s just not going to be his best work.”

The Two Expensive Patterns MCP Forces

Pattern 1: Tool definitions overload the context window

Most MCP clients load all tool definitions upfront. With 40+ tools, you’re looking at 50,000+ tokens in the context window before the agent starts reasoning. That’s not background noise—it’s context pollution that actively degrades reasoning quality.

Real-world impact:

Cursor’s practical limit: ~40 tools before performance degradation
Claude Code users reporting 66,000+ tokens consumed before conversation starts
GitHub’s MCP server: 93 tools = 55,000 tokens

Pattern 2: Intermediate results consume additional tokens

Every tool call result flows through the context. Agent reads a 10,000-row spreadsheet? Those rows are now in context. Agent filters to 50 relevant rows? Both the original 10,000 and the filtered 50 are in context. Multi-step workflows create exponential token bloat.

How Code Execution Flips This

Code execution addresses both problems:

Progressive disclosure: Agent explores the filesystem to find tools as needed, not all upfront. Load only what you use.
Data stays in sandbox: Process 10,000 rows in memory, return only the 5-row summary to the model.
Deterministic execution: Code runs the same way every time; no probabilistic tool-call generation that varies between runs.
Familiar debugging tools: Logs, breakpoints, unit tests—everything developers already know.

Real-World Example: The Unit Number Investigation

Let me show you what this looks like in practice.

I recently debugged a production issue in a booking system: unit numbers were missing from appointments, leaving field technicians at 100-unit apartment buildings with no way to find the customer. The unit data was entered by users but disappeared somewhere in the pipeline.

The Investigation

Instead of using MCP, I let the agent write code:

log_analyzer.py — Parsed booking logs into structured transaction objects, grouped by appointment ID
backup_tracer.py — Queried CSV backups to trace unit values through the CRM pipeline (Web_History → Contact → Appointment → WorkOrder)
unit_tracker.sh — Pulled specific appointment timelines from server logs to correlate user input with database state
address_field_audit.py — Compared all address-related fields across different CRM objects to identify inconsistencies

The Process

The agent processed gigabytes of logs and backup files in the execution sandbox. It discovered:

The autocomplete widget stripped unit numbers before submission (known limitation)
An optional unit field existed on the final booking screen, but 60% of users skipped it
When users DID enter units, 94% propagated correctly through the pipeline
A small bug: over-aggressive string sanitization that stripped hyphens, corrupting valid unit formats like “17/43-51” → “17/4351”

The agent returned a 2-page analysis with root cause, affected records count, and suggested fixes.

Context usage? Minimal. Just the summary and key findings.

The MCP Alternative (Nightmare Scenario)

If I’d used MCP’s tool-calling approach, I’d need tools like:

connectToServer, readLogFile, parseLogLine, filterByDate, filterByAppointmentID
connectToBackup, listBackupFiles, extractCSV, searchCSV, getRowsByID
correlateData, analyzePattern, identifyAnomalies, generateReport

Each of those tools would dump its results into context:

50MB log file → loaded into context → passed to parser → loaded again
10,000-row CSV → loaded into context → filtered → both original and filtered versions in context
Correlation results → loaded into context → analyzed → both raw and analyzed data in context

The investigation would be:

Slower: Each tool call requires a full model inference pass
More expensive: 100,000+ tokens for intermediate data vs 2,000 tokens for the summary
More fragile: Likely to hit context window limits halfway through, requiring manual intervention

“When agents use code execution with MCP, intermediate results stay in the execution environment by default. This way, the agent only sees what you explicitly log or return, meaning data you don’t wish to share with the model can flow through your workflow without ever entering the model’s context.”

— Anthropic Engineering Blog

Production Infrastructure Is Ready

The most common objection I hear: “Isn’t code execution risky? What about security? What if the agent writes malicious code?”

Fair questions. Here’s the reality:

Sandboxing is production-ready, and ironically, more secure than MCP’s lack of built-in authentication.

Solution	Isolation	Cold Start	Best For
E2B	Firecracker microVMs	150ms	AI code execution, 24hr persistence
Daytona	OCI containers	90-200ms	Enterprise dev environments
Modal	gVisor containers	Sub-second	ML/AI workloads at scale
Cloudflare Workers	V8 isolates	Milliseconds	Disposable execution, lowest cost

Security Is Well-Understood

NVIDIA’s AI red team is explicit about this:

“Execution isolation is mandatory for AI-driven code execution. Sandboxing each execution instance limits the blast radius of malicious or unintended code. This control shifts security from reactive patching to proactive containment.”

— NVIDIA Developer Blog

The defense-in-depth layers include:

Filesystem isolation: Sandbox can only access explicitly mounted directories
Network restrictions: No internet access unless explicitly granted via bindings
Resource limits: CPU, memory, and execution time caps prevent runaway processes
API key hiding: Credentials stay in the orchestration layer, never exposed to generated code

Meanwhile, MCP has no built-in OAuth or authentication mechanism. Teams build custom auth layers anyway. Sandboxing is well-understood infrastructure, not a novel risk.

Production Validation

Docker partnered with E2B to provide “fast, secure access to hundreds of real-world tools, without sacrificing safety or speed.” Google Cloud detailed their Agent Factory security architecture using gVisor sandboxing on Cloud Run. This isn’t experimental—it’s production infrastructure.

When MCP Still Makes Sense

Let’s be clear about something: this isn’t “MCP is bad and should be abandoned.”

MCP has valid use cases—they’re just narrower than the industry initially assumed:

Enterprise Multi-Agent Orchestration

Coordinating 10+ agents from different vendors where:

Standardization enables governance and compliance
Vendor neutrality is a hard requirement
Performance is less critical than auditability

This is a real scenario. It’s just not what most teams are building.

MCP as Backing Protocol

MCP can handle connections, authentication, and discovery while agents write code as the primary interface. This division of labor makes sense:

MCP: Transport layer, auth, tool registry
Code execution: Primary interface for the agent

You get standardization benefits where they matter (infrastructure) without the performance tax (agent operations).

The 80/20 Reality

But for most teams—80%+ of use cases—the answer is clear. Code execution is the default for:

Single-team agents for specific workflows
Developer productivity tools
Production-scale systems where performance matters
Data processing and analysis tasks
Complex multi-step operations

The New Default

Code execution is the default. MCP is the special case.

What This Means For Your Team

If you’re building agents right now, here’s your action plan:

Stop Fighting MCP Limitations

Work around them instead:

Let agents write code (Python, TypeScript, JavaScript—languages with rich training data)
Use production sandboxes (E2B, Daytona, Modal, Cloudflare Workers)
Load tools progressively (filesystem exploration, search functions, on-demand imports)
Keep data in the execution environment, return compact summaries
Use familiar debugging tools (logs, breakpoints, unit tests)

If You’re Already Invested in MCP

You didn’t waste your time. You learned what works and what doesn’t—that’s valuable data. And the good news: code execution can coexist with MCP.

Migration path:

Keep MCP for transport and authentication
Introduce code execution for one high-value workflow
Measure the difference (tokens, latency, quality)
Gradually migrate more workflows based on results

Progressive migration, not a rewrite.

Rethink “It’s The Standard” As A Decision Heuristic

The creators of the standard are telling you to work around it. That’s not failure—it’s honest engineering. But it is permission to choose performance over standardization.

Next time someone says “we should use MCP because it’s the standard,” ask: What problem are we solving—interoperability or performance?

For most teams, the answer is performance.

The Deeper Insight

This isn’t just about MCP. It’s about a more fundamental principle in AI engineering:

Design interfaces that match model capabilities.

The best interface for an LLM is the one closest to its training distribution. Code syntax beats formal schemas because models have seen vastly more code during pre-training.

This principle extends beyond agents:

Prompt engineering: Natural language descriptions > rigid templates (more training data)
API design for agents: REST-like endpoints with JSON > bespoke protocols (familiar patterns)
Error messages: Natural language explanations > error codes (easier to reason about)
Documentation: Code comments and examples > formal specifications

Could MCP Improve?

Absolutely. Progressive discovery, better caching, and protocol optimizations will help.

But the fundamental challenge—models are trained on code, not tool schemas—isn’t solvable through protocol tweaks. Even an optimized MCP will underperform code execution because the interface doesn’t align with model capabilities.

As models are trained on more tool-calling examples, the gap might narrow. But it won’t disappear until tool-calling training data approaches the scale of code training data. We’re talking orders of magnitude difference.

The Healthy Future

The future might be MCP as backing infrastructure with code as the primary interface. That’s actually a healthy division of labor:

MCP handles standardization, discovery, authentication
Code execution handles performance, flexibility, developer experience

Both can coexist. Both can improve. The question is which one you optimize for which use case.

Conclusion: Choose The Right Tool

Anthropic’s research doesn’t invalidate MCP. It clarifies its place in the ecosystem.

MCP is valuable for:

Enterprise orchestration of heterogeneous agents
Standardized tool discovery and registration
Vendor-neutral integration layers

Code execution is superior for:

Developer productivity (most teams)
Production performance (speed, cost, reliability)
Complex data processing and analysis
Multi-step workflows with intermediate results

The 98.7% token reduction isn’t a fluke. It’s evidence of fundamental alignment between interface and capability.

Next time you’re architecting an agent system, don’t default to “the standard.” Ask: What problem are we solving?

If the answer is performance, the data is clear: code wins.

Not because it’s trendy. Because the evidence proves it.

What’s your experience with MCP vs code execution?

Have you hit the context bloat wall? Found creative workarounds? Built production systems with either approach?

Share your story in the comments. The best way to advance the field is to share what actually works in production, not just what works in demos.

Discover more from Leverage AI for your business

Subscribe to get the latest posts sent to your email.

Why Code-First Agents Beat MCP by 98.7%