Engineering Guide Series

Why Code-First Agents
Beat MCP by 98.7%

An Engineering Guide to Moving Beyond MCP

Author: Scott Farrell

Target Audience: Senior Engineers, Engineering Leads, Technical Architects

What You'll Learn

✓ Why Anthropic's research shows code execution reduces token usage by 98.7%
✓ How training distribution bias explains the performance gap between code and tool-calling
✓ Production-ready patterns for secure code execution with sandboxing
✓ When to use MCP vs code-first approaches in real-world systems
✓ Concrete implementation roadmap from architectural decision to production

The Confession

When Protocol Creators Challenge Their Own Work

TL;DR

• Anthropic published research showing code execution reduces token usage by 98.7% compared to their Model Context Protocol (150,000 → 2,000 tokens)
• Cloudflare independently validated the same pattern with "Code Mode" in production at scale
• This is exemplary engineering self-correction—evidence over ego—not a failure
• Two structural problems: tool definitions overload context window, and intermediate results accumulate exponentially

The Rare Moment of Engineering Honesty

In November 2024, Anthropic published a research article titled "Code execution with MCP: building more efficient AI agents." On the surface, this appears to be a routine engineering update. In reality, it represents something far more significant: a protocol creator publicly acknowledging that their widely-adopted standard doesn't scale as well as a different approach.

This isn't a minor optimization or implementation detail. It's a fundamental architecture reassessment of the Model Context Protocol (MCP), the very standard that Microsoft, IBM, and Windows AI Foundry have backed as the "USB-C for AI applications." The same protocol that saw thousands of community-built MCP servers emerge within months of its November 2024 launch. The same protocol that developers adopted because it was positioned as the industry solution to agent-tool integration fragmentation.

What makes this moment rare in the AI industry is Anthropic's intellectual honesty. They could have quietly pivoted their internal implementations without admitting the architectural limitations. Instead, they published transparent research with quantified findings, giving the engineering community permission to question whether "the standard" is actually the right choice for their use case.

The Numbers: 98.7% Token Reduction

The core finding is stark and quantified. For a typical multi-tool agent workflow:

150,000

tokens (MCP approach)

Tool definitions + intermediate results flowing through context

2,000

tokens (code execution)

Only essential summaries returned to model

98.7%

Waste eliminated through code execution

This isn't an incremental improvement—it's a fundamental efficiency gap. The reduction of 148,000 tokens translates directly into:

$ Token costs: If you're spending $10,000/month on agent operations, approximately $9,870 is structural waste that code execution eliminates
⚡ Latency: Processing 150,000 tokens versus 2,000 tokens represents roughly 75x difference in time-to-first-token
🎯 Quality: Less context bloat means more room for actual reasoning—agents become more accurate, not less, as capabilities expand
📈 Scale: MCP breaks down at 40-50 tools; code execution scales to hundreds of available functions through progressive discovery

Token Math Breakdown

MCP Approach (150,000 tokens):

• System prompt: 5,000 tokens
• Tool definitions (30 tools avg): 24,000 tokens
• User query: 500 tokens
• Assistant response 1: 300 tokens
• Tool call 1 result (document): 8,000 tokens
• Assistant response 2: 400 tokens
• Tool call 2 result (large dataset): 110,000 tokens
• Final response: 1,800 tokens

Code Execution Approach (2,000 tokens):

• System prompt: 1,200 tokens
• User query: 500 tokens
• Generated code: 150 tokens
• Execution result summary: 100 tokens
• Final response: 50 tokens

Independent Validation: Cloudflare Code Mode

What elevates this from "interesting research" to "engineering consensus" is that Cloudflare reached the identical conclusion independently. Their "Code Mode" implementation validates that this isn't a theoretical optimization—it's a production-proven pattern at scale.

"It turns out we've all been using MCP wrong. Most agents today use MCP by directly exposing the 'tools' to the LLM. We tried something different: Convert the MCP tools into a TypeScript API, and then ask an LLM to write code that calls that API."

— Cloudflare Engineering Blog

Cloudflare's implementation demonstrates several crucial points that validate Anthropic's research:

Production Scale

Not a lab experiment—actually deployed in Cloudflare's infrastructure serving real workloads. The isolate-based architecture achieves millisecond cold starts (vs 90-200ms for containers), proving the pattern works at global scale.

Independent Discovery

Cloudflare wasn't implementing Anthropic's recommendation—they discovered the same solution independently. When two world-class engineering teams arrive at identical conclusions through different paths, it's signal, not noise.

Cost Validation

Cloudflare's isolate-based approach is "significantly lower cost than container-based solutions." The efficiency gains aren't just theoretical—they translate to measurable infrastructure savings at enterprise scale.

The convergence is captured in Cloudflare's succinct conclusion:

"In short, LLMs are better at writing code to call MCP, than at calling MCP directly."

— Cloudflare Engineering Blog

What This Isn't: Reframing as Engineering Self-Correction

Before we proceed to the technical mechanisms, it's crucial to frame what Anthropic's research actually represents—and what it doesn't.

This is not a failure. It's exemplary engineering culture in action.

This pattern—ship, gather data, identify limitations, iterate publicly—is exactly how good engineering organizations operate. Evidence trumps ego. The willingness to publish research that undermines your own protocol is a strength, not a weakness.

It's also historically normal. Early technology standards routinely hit performance walls, prompting architectural evolution:

→ Model-T to assembly line: The car didn't fail; the manufacturing process evolved to meet scale demands
→ Web 1.0 to Web 2.0: Static HTML gave way to dynamic, interactive architectures as use cases expanded
→ REST to GraphQL: RESTful APIs proved inefficient for complex data graphs; the ecosystem adapted
→ MCP to code execution: Tool-calling protocols meet their scale ceiling; code interfaces prove superior

"Although many of these problems here feel novel—context management, tool composition, and state persistence—they have known solutions from software engineering. Code execution applies these established patterns to agents."

— Anthropic Engineering Blog

The Two Problems Anthropic Identified

Anthropic's research pinpoints two structural issues that cause MCP's performance degradation. Understanding these mechanisms is essential before evaluating the code-execution alternative.

Problem 1: Tool Definitions Overload the Context Window

Most MCP client implementations load all available tool definitions directly into the model's context window before the agent even sees the user's query. This isn't a bug—it's how MCP is designed to work. The model needs to know what tools exist in order to plan its actions.

The problem emerges at scale. Real-world measurements show severe context consumption:

Each tool definition contains structured metadata that the model must process:

Example Tool Definition Schema (JSON)

{ "name": "gdrive.getDocument", "description": "Retrieves a document from Google Drive", "parameters": { "documentId": { "type": "string", "required": true, "description": "The ID of the document to retrieve" }, "fields": { "type": "string", "required": false, "description": "Specific fields to return" } }, "returns": "Document object with title, body content, metadata, permissions, etc." }

This single tool definition consumes approximately 200 tokens. Now multiply across realistic agent deployments: a CRM integration (15 tools), cloud storage (8 tools), communication platforms (12 tools), databases (10 tools), analytics (6 tools). You've consumed 10,200-51,000 tokens before the agent begins reasoning about the user's actual request.

The consequence: agents become dumber as you add capabilities, because the context window that should be used for reasoning is instead filled with tool catalogs.

Problem 2: Intermediate Results Consume Additional Tokens Exponentially

The second structural issue is more insidious. Every intermediate result from a tool call must flow through the model's context to reach the next operation. This creates exponential token accumulation as workflows deepen.

Token Accumulation Through Multi-Step Workflow

Step 1: Retrieve Google Drive document

• Context at start: 30,000 tokens (system + tools + query)
• Document content returned: 15,000 tokens
• Context now: 45,000 tokens total

Step 2: Update Salesforce record with document

• Context carried forward: 45,000 tokens (includes full document)
• Agent must copy 15,000-token document into tool parameters
• Salesforce confirmation: 2,000 tokens
• Context now: 47,000 tokens (but processed 62,000 effective tokens)

Step 3: Send Slack notification summary

• Context carried forward: 47,000 tokens (still includes full document)
• Notification sent: 500 tokens
• Context now: 47,500 tokens

Cumulative Analysis:

• Total tokens processed across calls: ~155,000
• Actual useful data transferred: ~17,500 tokens
• Overhead waste: ~137,500 tokens (88%)

The problem compounds further when large datasets are involved. Anthropic's research provides a telling example: retrieving a 2-hour meeting transcript causes the full transcript content to flow through the model's context twice—once when retrieved, again when copied into the next tool call. That single document can add 50,000 tokens of redundant processing.

Industry Implications

Anthropic's research and Cloudflare's validation create a permission structure that ripples across multiple organizational levels.

Impact by Stakeholder

Individual Developers

• Permission to question "industry standard" when it underperforms
• Quantified evidence to defend code-first architectural decisions
• Framework for evaluating agent patterns: performance vs standardization

Engineering Teams

• Validation that MCP issues are structural, not implementation bugs
• Alternative pattern backed by primary research (Anthropic, Cloudflare)
• Can refocus from fighting infrastructure to shipping features

The Industry

• Shift from "MCP as default" to "code execution as default, MCP where appropriate"
• Tool vendors may pivot to code-first SDKs alongside MCP servers
• Honest discourse about performance vs ecosystem tradeoffs

What's Coming in This Ebook

This opening chapter established the foundation: Anthropic's admission, Cloudflare's validation, and the two structural problems driving MCP's performance ceiling. The remaining chapters build from this base to provide actionable engineering guidance.

Chapter 2: The Bloat Problem

Deep-dive on token accumulation mechanics, context window visualization, and why agents degrade as tool counts increase

Chapter 3: Training Distribution Bias

Why code syntax outperforms tool schemas: LLM training corpus analysis, pattern-matching fundamentals, Shakespeare-in-Mandarin analogy

Chapter 4: Code-First Architecture

The alternative pattern Anthropic and Cloudflare recommend: progressive disclosure, signal extraction, sandbox execution, state persistence

Chapter 5: Production Sandboxing

Making code execution safe at scale: E2B, Daytona, Modal, Cloudflare Workers comparison, defense-in-depth strategies, security boundaries

Chapter 6: Case Study—Unit Number Investigation

Real-world application: booking system debugging, autonomous tool creation, log/backup analysis, signal extraction without context bloat

Chapter 7: The MCP Sweet Spot

When standards win: automated testing with Playwright, sub-agent orchestration, MCP as transport layer, nuanced positioning

Chapter 8: Implementation Roadmap

From architectural decision to production: migration paths, team buy-in strategies, measurement frameworks, iteration cycles

Chapter Summary

Anthropic published research showing 98.7% token reduction (150,000 → 2,000 tokens) using code execution versus MCP tool-calling. Cloudflare independently validated the pattern with production "Code Mode" deployment. This represents engineering self-correction, not failure—evidence over ego.

Two structural problems drive MCP's limitations: (1) tool definitions overload the context window before agents begin reasoning, and (2) intermediate results accumulate exponentially through multi-step workflows. The industry must shift from "standards at any cost" to "what actually works in production."

"Code execution with MCP improves context efficiency by loading tools on demand, filtering data before it reaches the model, and executing complex logic in a single step."

— Anthropic Engineering Blog

Next: Understanding the mechanism—why agents get dumber as you add tools.

The Bloat Problem: Why Agents Get Dumber as You Add Tools

You followed the documentation. Connected 30 tools. Suddenly your agent takes 15 seconds to respond and gives worse answers than when it had 5 tools. You thought you were doing something wrong.

You weren't. The architecture was.

The assumed causes were all wrong. "We're not writing tool descriptions well enough." "Our prompts need better few-shot examples." "Maybe we need a bigger context window." "Probably need a more powerful model."

The actual cause: structural token bloat in MCP architecture. Not an implementation bug—a fundamental design characteristic.

"Tool descriptions occupy more context window space, increasing response time and costs. In cases where agents are connected to thousands of tools, they'll need to process hundreds of thousands of tokens before reading a request."

— Anthropic Engineering Blog, "Code execution with MCP"

The Context Window: A Finite Resource

A context window is the total tokens an LLM can process in a single request. It includes everything: system prompt, tool definitions, conversation history, and the current query.

Typical Context Window Sizes (2025)

Model	Context Window
Claude 3.5 Sonnet	200,000 tokens
GPT-4 Turbo	128,000 tokens
Gemini 1.5 Pro	1,000,000 tokens

But here's the hidden cost: more context doesn't equal better results. More tokens to process means higher latency and higher costs. And quality degrades as relevant signal drowns in noise.

Problem 1: The Upfront Tool Definition Tax

Most MCP clients load all tool definitions into context immediately, before the agent even sees your query. No progressive discovery—everything upfront.

# Example: Salesforce Tool Definition (JSON Schema)

{ "name": "salesforce.updateRecord", "description": "Updates a record in Salesforce CRM system.", "parameters": { "objectType": { "type": "string", "required": true, "description": "Type of Salesforce object (Lead, Contact, Account, Opportunity, Case, etc.)" }, "recordId": { "type": "string", "required": true, "description": "18-character Salesforce ID, alphanumeric, case-sensitive" }, "data": { "type": "object", "required": true, "description": "Fields to update with new values" } }, "returns": "Updated record with ID, timestamp, validation status", "examples": [ ... 3-5 detailed examples ... ], "errors": [ ... error codes and handling ... ] }

A single tool definition: 400-800 tokens including examples and error handling

The math at scale becomes brutal.

Small Setup (10 tools)

Calculation: 10 tools × 500 tokens average = 5,000 tokens

Impact: Manageable overhead

Medium Setup (30 tools)

Calculation: 30 tools × 500 tokens = 15,000 tokens

Impact: Significant context consumption before user query

Large Setup (100 tools)

Calculation: 100 tools × 500 tokens = 50,000 tokens

System prompt overhead: +5,000 tokens

Total before user query: 55,000 tokens

Example: 10% of Claude's 200k window used before work begins

Real-World Impact: The GitHub MCP Server

Simon Willison documented a striking example: the GitHub MCP server alone defines 93 tools and consumes 55,000 tokens. That's nearly 28% of Claude's 200,000-token context window for a single MCP integration.

Add 2-3 more popular MCP servers and you've consumed 80%+ of your context window before the agent starts working.

Problem 2: Exponential Intermediate Result Accumulation

Each tool call adds its result to the context. The next tool call must include all previous context. This isn't linear growth—it's exponential.

Worked Example: Google Drive → Salesforce Workflow

Step 1: Retrieve Document from Google Drive

Context at step 1: ├─ System prompt: 3,000 tokens ├─ Tool definitions (30 tools): 15,000 tokens ├─ User query: 500 tokens └─ Total: 18,500 tokens Agent calls: gdrive.getDocument(documentId: "abc123") Tool result (2-hour meeting transcript): 12,000 tokens Context after step 1: ├─ Previous context: 18,500 tokens ├─ Assistant message (tool call): 150 tokens ├─ Tool result: 12,000 tokens └─ Total: 30,650 tokens

Step 2: Update Salesforce with Transcript

Context at step 2: 30,650 tokens (all carried forward)

Agent must: 1. Read 12,000-token transcript from context 2. Format it for Salesforce API 3. Call salesforce.updateRecord with transcript as parameter salesforce.updateRecord({ objectType: "SalesMeeting", recordId: "00Q5f000001abcXYZ", data: { Notes: "[entire 12,000-token transcript copied here]" } }) Tool result: 800 tokens (confirmation) Context after step 2: ├─ Previous context: 30,650 tokens ├─ Assistant message (includes transcript): 12,300 tokens ├─ Tool result: 800 tokens └─ Total: 43,750 tokens

The waste is staggering: The transcript appears in context three times. First as the tool result from Google Drive (12,000 tokens). Second in the assistant's tool call to Salesforce (12,000 tokens). Third as referenced in the Salesforce confirmation.

Effective tokens processed: 24,000+ tokens for a single data transfer. As Anthropic notes: "Every intermediate result must pass through the model. In this example, the full call transcript flows through twice."

The MCP Context Bloat Mechanism

❌ MCP Direct Tool Calling

• All tool definitions loaded upfront (15,000+ tokens)
• Every tool result flows through context
• Intermediate data copied between calls
• Context grows exponentially with each step

Result: 150,000 tokens for multi-step workflow

✓ Code Execution Pattern

• Load only needed tool definitions (progressive)
• Data stays in execution sandbox
• Only summaries return to context
• Context remains compact throughout

Result: 2,000 tokens for same workflow (98.7% reduction)

The Compounding Effect: Multi-Step Workflows

Consider a research agent analyzing 10 competitor websites. With MCP direct tool calling, the token accumulation becomes catastrophic.

Scenario: Research Agent (10 Websites)

Step 1: Scrape competitor 1 → 8,000 token result Context: 18k base + 8k = 26k tokens Step 2: Scrape competitor 2 → 8,000 token result Context: 26k + 8k = 34k tokens Step 3: Scrape competitor 3 → 8,000 token result Context: 34k + 8k = 42k tokens [... steps 4-10 ...] Step 10: Scrape competitor 10 → 8,000 token result Context: 90k + 8k = 98k tokens Step 11: Aggregate all data into report Context: 98k tokens Generates report: 2,000 tokens Total context: 100k tokens

Total Tokens Billed Across All Steps:

Step 1: 26k + Step 2: 34k + Step 3: 42k + ... + Step 11: 100k

~600,000 tokens

Cost Impact: The Hidden Waste

Single Workflow Cost

Claude Sonnet 3.5 pricing: $3 per million input tokens

600,000 tokens = $1.80 per workflow

Production Scale

Run 1,000 times per day:

$1,800/day = $54,000/month

For one agent workflow type. Multiply by the number of different workflows you run.

What You See vs What Gets Billed

The Token Accounting Illusion

Visible to User

User query: 50 tokens
Agent response: 200 tokens
Total visible: 250 tokens

Actually Billed (30-tool MCP)

System prompt: 3,000 tokens
Tool definitions: 15,000 tokens
User query: 50 tokens
Tool call results: 13,000 tokens
Agent response: 200 tokens
Total billed: 31,250 tokens

Overhead: 31,000 tokens — that's 124 times the visible interaction

The Performance Degradation Mechanism

Why do more tools make your agent dumber? Four compounding effects:

1. Signal-to-Noise Ratio Collapse

With 5 tools, the agent clearly sees relevant options. With 50 tools, relevant options are buried in irrelevant noise. LLM attention mechanisms struggle with large candidate pools.

2. Context Window Crowding

Tool definitions displace actual task context. Less room for your code, project details, or conversation history. Important information "scrolls out" of the effective window.

3. Latency Increases

More tokens to process equals longer time-to-first-token. User experience degrades with 15+ second waits. Compounding effect: each step slower than the previous.

4. Selection Paralysis

As Cloudflare notes: "If you present an LLM with too many tools, it may struggle to choose the right one or use it correctly." The model must evaluate 50+ options per decision point, increasing error rates and retries.

The Scale Thresholds

Tool Count	Performance Impact
< 10 tools	Manageable with MCP
10-20 tools	Degradation begins
20-40 tools	Serious performance issues
40+ tools	Architectural crisis

When Does This Become Critical?

The tipping point arrives when:

Token costs spike 10-50x unexpectedly
Agent response times exceed user patience (>10 seconds)
Agent quality noticeably degrades
Production incidents occur due to context window overflow
Finance asks: "Why is our AI bill $50,000 per month?"

Chapter Summary

• Upfront tax: All tool definitions loaded immediately (10k-66k+ tokens before user query)
• Exponential accumulation: Each tool result adds to context, carried forward indefinitely
• No garbage collection: Previous step results remain in context even when irrelevant
• Message loop overhead: Every tool call equals a new LLM request with full context
• Performance cliff: More tools equals worse quality, higher latency, exploding costs

"In cases where agents are connected to thousands of tools, they'll need to process hundreds of thousands of tokens before reading a request. This isn't just expensive—it fundamentally degrades the agent's reasoning capability."

— Anthropic Engineering Blog

Next chapter: Why code execution works better—the training distribution insight that explains everything.

The Developer's Verdict: What Theo Browne Really Thinks About MCP

Before we hear from researchers and protocol architects, let's listen to someone building with these tools in production. Someone with 470,000+ developers watching his every move. Someone who doesn't pull punches.

Why This Developer's Opinion Carries Weight

Theo Browne isn't publishing academic papers or designing protocols. He's building products, shipping code, and documenting what actually works in the messy reality of production systems.

Theo Browne

Creator of T3 Stack • CEO of Ping.gg • Former Twitch Engineer

470K+

YouTube Subscribers

15K+

GitHub Followers

YC W22

Y Combinator

His track record speaks for itself:

T3 Stack: A TypeScript-first full-stack framework (Next.js + tRPC + Tailwind + Prisma) adopted by thousands of developers worldwide
UploadThing: Developer tools for file uploads that abstract away complexity
Ping.gg: Tools for video content creators, built by someone who creates daily technical content
Twitch Engineering: Scaled video infrastructure for millions of concurrent users

His YouTube channel (t3.gg) has become known for "memes, hot takes, and building useful things for devs." He's polarizing—some find him overconfident, others appreciate his willingness to say what many developers think but won't say publicly.

Why Theo's Perspective Matters

He represents the grassroots developer community—the engineers actually trying to ship production code with these protocols. Not the vendors selling infrastructure, not the researchers publishing papers, but the builders in the trenches making architectural decisions under deadline pressure.

When Anthropic published their "Code execution with MCP" research, Theo dedicated a 19-minute video to deconstructing it. The video wasn't academic analysis. It was visceral developer feedback—the kind you hear in Slack channels and engineering standups, not conference keynotes.

And that's exactly why it matters.

The Moment of Vindication

"Thank you, Anthropic, for admitting I was right the whole fucking time. It makes no sense to just clog up your system prompt with a bunch of shit that probably isn't relevant for the majority of work you're doing."

— Theo Browne, first 60 seconds of his MCP video

The frustration is palpable. Here's a developer who's been building with (and fighting against) MCP in production, watching the protocol's creators publish research that validates everything he's been complaining about.

His opening framing sets the tone:

"It's time for another MCP video. If you're not familiar with my takes on MCP, it's my favourite example of AI being a bubble. I know way more companies building observability tools for MCP stuff than I know companies actually making useful stuff with MCP."

Translation: The ecosystem around MCP has more infrastructure than applications. More tooling vendors than actual users. More hype than results.

The Web3 Pattern

Theo draws a parallel to another hyped technology:

"I still remember back in the day when web3 was blowing up that I knew about six companies doing OAuth for web3 and one single company that could potentially benefit from that existing."

When everyone's building the picks and shovels but nobody's mining gold, that's a red flag for a bubble.

Three Technical Failures Theo Won't Let Slide

Strip away the colorful language, and Theo's critique breaks down into three core arguments—each backed by specific technical points that working developers will recognize immediately.

1. The Spec Itself Is Fundamentally Incomplete

"Not that the spec sucks, which it does..."

Theo doesn't mince words. His primary technical grievance?

"Did you know MCP has no concept of OAuth at all? At all. Now there's like 18 implementations of it because there's no way to do proper handshakes with MCP."

This is damning. A protocol designed to connect agents to external systems—many of which require authentication—ships without a standard auth mechanism. Every team builds custom solutions, defeating the entire purpose of standardization.

The Authentication Problem

MCP was supposed to reduce fragmentation. Instead:

18+ custom OAuth implementations
Hard-coded URLs with signed parameters
Every team solving the same problem differently
Zero interoperability for the one thing that matters most

Theo's assessment:

"MCP provides a universal protocol that does a third of what you need. Developers implement MCP once in their agent and then five additional layers to make it work."

2. Models Get Dumber, Not Smarter, With More Tools

"Models do not get smarter when you give them more tools. They get smarter when you give them a small subset of really good tools."

This contradicts the entire "ecosystem of integrations" promise. If connecting more MCP servers makes your agent worse, what's the point?

Theo walks through the context bloat problem with characteristic directness:

"In cases where agents are connected to thousands of tools, they'll need to process hundreds of thousands of tokens before reading a request."

And then the intermediate results problem—this is where it gets expensive:

"Every additional tool call is carrying all of the previous context. So every time a tool is being called, the entire history is being re-hit as input tokens. Insane. It's so much bloat. It uses so much context. It burns through so many tokens and so much money."

First Tool Call

Context size: 20 tokens
Cost: Baseline

Second Tool Call

Context size: 40 tokens
Cost: 2x baseline

Third Tool Call

Context size: 60 tokens
Cost: 3x baseline

Fourth Tool Call

Context size: 80 tokens
Cost: 4x baseline

He illustrates this with a concrete example developers will recognize:

"Instead of gdrive.getDocument, how about we do gdrive.findDocumentContent... This will return an array of documents. So then you do a bunch more tool calls because you want to have this content... each of these is an additional message being sent to the model that is a whole separate request, and each of these adds to the context."

The cost compounds exponentially. If you don't have caching set up properly for your inputs, you're burning cash on redundant context. And most teams don't even realize this is happening until they get the bill.

3. The Creators Are Now Admitting It Doesn't Work

This is where Theo gets genuinely frustrated. The company that created MCP has published research showing it's inefficient:

"Anthropic's own words: 'every intermediate result must pass through the model. In this example, the full call transcript flows through twice. For a two-hour sales meeting, that could mean processing an additional 50,000 tokens.'"

And then the kicker—the stat that appears throughout this entire ebook:

98.7%

Token reduction using code execution instead of MCP

Theo's reaction is part vindication, part disbelief:

"How the fuck can you pretend that MCP is the right standard when doing a fucking codegen solution instead saves you 99% of the wasted shit? That is so funny to me."

He continues:

"The creators of MCP are sitting here and telling us that writing fucking TypeScript code is 99% more effective than using their spec as they wrote it. This is so amusing to me."

Why Code Actually Works Better: Theo's Explanation

Theo doesn't just complain—he explains the mechanism. And his explanation aligns perfectly with Anthropic's research (and our earlier chapter on training distribution):

"Cloudflare's 'Code Mode' post makes the explicit argument that LLMs have seen far more TypeScript than they have seen MCP-style tool descriptions, so code-generation is naturally stronger than tool-calling in many real tasks."

He continues with cutting clarity:

"Turns out that writing code is more effective than making a fucking generic wrapping layer that doesn't have half the shit you need. Who would have thought?"

The Training Distribution Insight

Models are trained on:

Millions of code examples (functions, imports, control flow, error handling)
Thousands of synthetic tool schema examples

When you ask a model to work with tool schemas, you're forcing it to use a pattern it barely knows. When you ask it to write code, you're letting it use patterns it's seen millions of times.

The interface must match the capability.

And later, with characteristic bluntness:

"Do you know what they [models] do well, because there's a lot of examples? Write code. It's so funny to see this line in an official thing on the Anthropic blog. They're admitting that their spec doesn't work for the thing they build, which is AI models. Hilarious."

The fundamental insight: You can't fix a capability mismatch through protocol optimization. Models are trained on code. Give them code.

Real-World Evidence: Tools vs. Results

Theo makes a point that echoes throughout the developer community:

"Since launching MCP in November of 2024, adoption has been rapid by people trying to sell you things, not people trying to make useful things. The community has built thousands of MCP servers... [but I've seen] zero well-documented production deployments at scale."

The Replit "Trey" Example

Theo analyzes a real production agent to illustrate the problem:

"When I was playing with it and I noticed the quality of outputs not being great, I decided to analyse what tools their agents have access to... There are 23 tools available for the Solo coding environment agent."

This includes:

7 separate tools for file management
3 for running commands
3 for Supabase (even for users who don't use Supabase)

His assessment:

"I don't use Supabase. I don't even have an account. I've never built anything with Supabase. But when I use Trey, every single request I send has this context included for things I don't even use."

Every request pays the token tax for tools that will never be used. Multiply this across 23 tools, across thousands of requests, across hundreds of users.

"Ah, this is awful. How is this where we ended up and we assumed everything was okay?"

The Security Strawman

Anthropic's blog mentioned security concerns with code execution:

"Note that code execution introduces its own complexity. Running agent-generated code requires a secure environment for execution with appropriate sandboxing, resource limits, and monitoring. These infrastructure requirements add operational overhead and security considerations that direct tool calls avoid."

Theo has zero patience for this argument:

"No. This is fucking bullshit. This is absolute fucking bullshit. Every implementation of MCP I've seen that can do anything is way more insecure than a basic fucking sandbox with some environment variables."

He's right. Here's why:

MCP's Security Reality

No built-in OAuth
Custom auth implementations
Hard-coded credentials
Signed URLs as workarounds
Every team reinventing security

Code Execution Security

Proven sandboxing (Firecracker, gVisor)
Production-ready solutions (E2B, Daytona, Modal)
Resource limits (CPU, memory, network)
Filesystem isolation
Battle-tested infrastructure

Theo specifically calls out proven solutions:

"I don't know if Daytona is a sponsor for this video or not... but Daytona is the only sane way to do this that I know of. These guys have made deploying these things so much easier. You want a cheap way to safely run AI-generated code, just use Daytona. They're not even paying me to say this."

Sandboxing Is Solved Infrastructure

Production-grade options include:

Firecracker: Powers AWS Lambda (billions of executions)
gVisor: Google's production sandboxing
E2B: Purpose-built for AI code execution
Daytona: Developer-friendly sandbox deployment
Modal: Serverless compute for AI workloads

The "security concern" is a distraction. This is solved infrastructure, not a novel risk.

Where Theo Is Spot-On (And Where He Overstates)

Where Theo Is Absolutely Right

Five Points Developers Will Recognize

MCP's spec is incomplete: No OAuth, no progressive discovery, missing critical features that every production system needs
Context bloat is structural: Not a bug you can patch, not fixable through optimization—it's baked into the architecture
Models are better at code than schemas: Training distribution mismatch is real and measurable
The ecosystem is upside-down: More tooling vendors than actual products signals a bubble
Sandboxing is solved infrastructure: Not a novel risk, not a blocker, not an excuse

Where Theo Overstates the Case

In fairness, there are a few areas where Theo's frustration leads him to overstate:

The Nuanced Position

"MCP is proof Python people will destroy everything": Language tribalism distracts from the architecture argument. The problem isn't Python vs TypeScript—it's formal schemas vs code syntax.
"This is all bullshit": Some enterprise use cases (multi-vendor orchestration, governance layers) still benefit from MCP as a transport protocol.
Dismissing all formalization: Standards do have value—just not as the primary agent interface for individual developers.

The synthesis: Theo's anger is justified, but his prescription (abandon MCP entirely) is slightly too extreme. The nuanced position—which Anthropic's research supports—is:

Code execution should be the default pattern for most teams
MCP can exist as a backing protocol (transport, auth, discovery)
Standards are valuable for orchestration, not for individual agent workflows

The Meta-Complaint: Why It Took So Long

Perhaps Theo's most frustrated moment comes when reflecting on the industry's slow realization:

"This is when I complain about AI bros not building software or understanding how the software world works. This is what I'm talking about. All of these things are obviously wrong and dumb. You just have to look at it to realise."

He credits Cloudflare—a company known for infrastructure engineering, not AI hype—for being among the first to publish about Code Mode:

"What do you think Cloudflare is better at: LLMs or software development? If you've used Cloudflare's infrastructure, you know they're good at writing code. They had to make this a very popular thing and idea, and I had to make videos about those things because I have strong opinions, to get Anthropic to start acknowledging these facts."

The implication: It took actual software engineers (not AI researchers) to point out that the emperor has no clothes.

The Grassroots Validation Loop

Here's what actually happened:

Developers complained about MCP performance in production
Anthropic listened and ran experiments
Anthropic published honest findings (code execution wins)
Developers like Theo amplified the research ("see, we were right")

That's healthy engineering culture. The loop closed. Feedback was heard, validated, and published.

What Developers Actually Want

Stripping away Theo's colorful language, his practical recommendations align with everything we've covered in previous chapters:

❌ Don't Do This

Load hundreds of tool definitions upfront
Stream intermediate results through context
Assume "more tools = better agent"
Build custom auth layers around incomplete protocols
Sacrifice performance for theoretical ecosystem benefits

✓ Do This Instead

Generate code that imports specific modules on-demand
Process large datasets in execution environments
Return compact summaries to the model
Use proven sandboxing solutions
Let models do what they're trained to do (write code)

His example workflow captures the essence:

"The model writes code that imports the GDrive client and the Salesforce client, defines the transcript as this thing that it awaited from the gdrive.getDocument call, and then puts it in... the content of this document never becomes part of the context. It's never seen by the model because it's not touching any of that, because the model doesn't need to know what's in the doc; it needs to know what to do with it. That's the whole fucking point."

The Uncomfortable Truth for AI Optimists

Theo's rant contains an uncomfortable truth that bears repeating:

"Whenever somebody tells you AI is going to replace developers, just link them this."

He elaborates:

"This is all the proof I need that we are good. This is what happens when you let LLMs, and more importantly you let LLM people, design your APIs. Devs should be defining what devs use."

Developers aren't going anywhere. We're the ones who:

Noticed MCP was broken when vendors were still hyping it
Built workarounds in production to keep systems running
Tested alternatives (code execution) and measured the results
Provided feedback that led to better research
Will ultimately decide which patterns survive

The AI bubble has many believers in "let the models figure it out." The engineering reality is: models need humans to design good interfaces. MCP's struggles prove it.

Theo's Final Verdict

"I'm going to continue to not really use MCP. I hope this helps you understand why."

But he ends on a constructive note:

"Let me know what you guys think. Let me know how you're executing your MCP tools."

Despite the strong language, Theo isn't shutting down conversation—he's inviting it. He wants to know if anyone has actually made MCP work at scale. He's open to being wrong.

That's good engineering.

The Problem: Still No Success Stories

Months after Theo's video, the "here's how we made MCP work in production at scale" success stories still haven't materialized. The observability vendors are still more numerous than the actual products.

The grassroots developer feedback loop is working. The ecosystem is listening. But the fundamental architecture issues remain.

Why This Chapter Matters

Anthropic's research gave us quantified evidence (98.7% reduction). Cloudflare gave us independent validation (Code Mode). But Theo gives us something equally important:

Proof That Working Developers Have Been Living This Pain

When senior engineers read Anthropic's blog, some will think "interesting research." When they watch Theo's video, they think "finally, someone said it."

Both matter.

The research provides the intellectual foundation. Theo provides the emotional permission to trust your engineering instincts over industry hype.

If you've been fighting MCP in production, wondering if you're doing it wrong, Theo's message is clear:

You're not crazy. The architecture is broken. Use code instead.

How This Connects to What We've Already Covered

Theo's perspective validates and extends our earlier analysis:

Chapter Integration

Chapter 1 (The Confession): Theo shows the grassroots developer reaction to Anthropic's research—vindication, frustration, and relief that someone finally acknowledged the problem.
Chapter 2 (The Bloat Problem): His concrete examples (gdrive.findDocumentContent → multiple getDocument calls) illustrate exactly how token accumulation happens in real workflows.
Chapter 3 (Training Distribution Bias): He articulates the same core insight—"models are good at code because they've seen lots of code"—in language developers immediately understand.
Chapter 4 (Code-First Architecture): His workflow description (import clients, process data, return summaries) is the pattern in action.

Key Takeaways: The Developer Manifesto

What Working Developers Need to Know

Developer sentiment matters: Academic research proves the mechanism, but grassroots feedback proves the pain is real and widespread.
The ecosystem is inverted: More MCP tooling vendors than actual MCP-powered products is a clear signal of a bubble.
Incomplete specs create fragmentation: No OAuth → 18 custom auth implementations defeats the entire purpose of standardization.
Context bloat isn't theoretical: Real developers are burning real money on token costs they can't explain to leadership.
Security concerns are a distraction: Sandboxing is solved infrastructure; MCP's missing auth is the actual security risk.
Code is the better interface: Not ideology, not opinion—measurable performance difference validated by the protocol's creators.
Engineering culture matters: Anthropic deserves credit for listening to developer feedback and publishing honest research.

"How the fuck can you pretend that MCP is the right standard when doing a fucking codegen solution instead saves you 99% of the wasted shit?"

— Theo Browne, speaking for developers everywhere

Training Distribution Bias

Why LLMs Are Shakespeare Writing in Mandarin

"Making an LLM perform tasks with tool calling is like putting Shakespeare through a month-long class in Mandarin and then asking him to write a play in it. It's just not going to be his best work."

— Cloudflare Engineering Blog

The Interface Must Match the Capability

The fundamental insight from Cloudflare cuts through all the technical complexity: LLMs excel at patterns they've seen millions of times and struggle with patterns seen thousands of times. This isn't about MCP being "bad engineering"—it's about interface-capability mismatch.

Shakespeare's capability was English literature—lifetime of training, native fluency, poetic mastery. Ask him to perform in Mandarin after a month-long class? He'd produce functional work, but nothing approaching his sonnets or plays. The training distribution simply wasn't there.

The same principle applies to LLMs. Code generation is their native fluency. Tool-calling schemas are their Mandarin crash course.

The Training Data Reality

Let's examine what LLMs actually learned during pre-training, because the numbers reveal why code execution works better.

Training Data Scale: Code vs Tool Schemas

Code Corpus: Massive Scale

OpenCoder: 2.5 trillion tokens (90% raw code)
RefineCode: 960 billion tokens across 607 languages
Dolma: 3 trillion tokens of general web data
Stack Overflow: 58 million questions/answers (35GB)
GitHub: Hundreds of millions of public repositories
Source: Real-world production code, tutorials, documentation

Millions of real-world examples • Organic patterns • Production-tested

Tool-Calling Corpus: Synthetic and Tiny

xlam-function-calling-60k: 60,000 samples
FunReason-MT: 17,000 multi-turn examples
Special tokens: Never seen "in the wild"
Format: [AVAILABLE_TOOLS], [TOOL_CALLS], etc.
Coverage: Contrived scenarios created by researchers
Source: Synthetic training data, not real usage

Thousands of synthetic examples • Artificial patterns • Research-generated

Gap Magnitude: ~50 million to 1

Code examples outnumber tool-calling examples by approximately 50 million to one in typical LLM training data.

Why This Matters: Pattern Matching vs Reasoning

Here's the uncomfortable truth about LLMs: they're not reasoning systems. They're pattern-matching engines that predict the next token based on training distribution. They perform best on patterns they've seen most frequently.

"In short, LLMs are better at writing code to call MCP, than at calling MCP directly."

— Cloudflare Engineering Blog

The "Seen It Before" Advantage

Let's make this concrete with a side-by-side comparison: the same task (filtering a dataset) via MCP tool calling vs code generation.

Example: Data Filtering Task

❌ Via MCP Tool Calling (Rare Pattern)

{ "tool": "filterDataset", "parameters": { "dataset": "sheet_abc123", "filterField": "Status", "filterValue": "pending", "returnFields": ["OrderID", "CustomerName", "Amount"] } }

• Model has seen ~10,000 synthetic examples of this pattern
• Must remember exact parameter names, types, nesting structure
• Error-prone: typos in field names, wrong data types, missing params

Training exposure: Thousands of contrived examples

✓ Via Code Generation (Ubiquitous Pattern)

const allRows = await gdrive.getSheet({ sheetId: 'abc123' }); const pendingOrders = allRows.filter(row => row.Status === 'pending' ); const result = pendingOrders.map(order => ({ OrderID: order.OrderID, CustomerName: order.CustomerName, Amount: order.Amount }));

• Model has seen .filter() and .map() millions of times
• Familiar syntax from GitHub, Stack Overflow, documentation
• Self-correcting: if code doesn't work, error messages guide fixes

Training exposure: Millions of real-world examples

The code approach leverages patterns the model has seen extensively during training, resulting in higher accuracy and better composition.

Training Data Breakdown: What LLMs Actually Know

What LLMs Have Seen A Lot Of

Python code: Billions of lines from GitHub, tutorials, Stack Overflow

TypeScript/JavaScript: Billions of lines from web dev, Node.js, React, libraries

Common patterns: import requests, for item in items, async/await, try/except

API calling: Millions of examples of REST endpoints, JSON responses, error handling

Control flow: if/else, while/for, switch/case in every tutorial ever written

Example: Stack Overflow alone contributes 35GB of curated code examples

What LLMs Have Seen Very Little Of

Formal tool schemas: Synthetic only, created by researchers for training

Special tokens: [AVAILABLE_TOOLS], [TOOL_CALLS], [TOOL_RESULTS] never appear in real code

Multi-step tool orchestration: Most training examples show single tool calls only

Complex workflows: Rare in training data; real production agent patterns too recent

Size target: Largest function-calling dataset has only 60,000 examples (<1MB equivalent)

Example: Mistral model adds special tokens purely for tool-calling—never seen in wild

The Shakespeare Analogy Expanded

Cloudflare's analogy deserves deeper exploration, because it captures the fundamental mismatch perfectly.

Dimension	Shakespeare in English	Shakespeare in Mandarin
Training	Lifetime immersion	Month-long crash course
Vocabulary	Native fluency, poetic mastery	Basic vocabulary memorized
Idiomatic expressions	Deeply understood, culturally embedded	Missing from training
Output quality	Sonnets, plays, poetic brilliance	Functional but not poetic, surface-level

Dimension	LLMs with Code Generation	LLMs with Tool Calling
Training data	2.5–3 trillion tokens, millions of examples	60,000 synthetic examples, never in wild
Pattern recognition	Native fluency (core training data)	Basic schema following (fine-tuned)
Error handling	Self-correcting (familiar error messages)	Error-prone (unfamiliar format)
Output quality	Reliable, composable, debuggable	Capable but error-prone, doesn't compose well

Why Formal Schemas Don't Help

There's a persistent intuition that "formal schemas should be easier—they're structured, typed, validated." But structure doesn't overcome training distribution bias.

As Cloudflare notes: "If you present an LLM with too many tools, or overly complex tools, it may struggle to choose the right one or to use it correctly. As a result, MCP server designers are encouraged to present greatly simplified APIs."

The problem: "Simplified for LLMs" often means "less useful to developers."

Tool APIs get dumbed down to help LLM selection. Code APIs can be rich and expressive because LLMs handle code syntax naturally.

Training Distribution Defines Performance Ceiling

Here's the core principle that explains everything:

Fundamental Principle

The best interface for an LLM is the one closest to its training distribution.

Code syntax beats formal schemas because models have seen vastly more code during pre-training.

This wasn't obvious earlier because we assumed human software engineering principles would transfer to LLMs. We thought: formal systems are better than ad-hoc ones, so LLMs should prefer structured tool schemas over freeform code generation.

But LLMs aren't reasoning systems—they're pattern-matching systems. Give them patterns they've seen before. Millions of times. Not thousands.

Predictive Power of This Insight

Understanding training distribution bias gives us predictive power about when tool calling might catch up to code execution, and when code will remain superior.

When Tool Calling WILL Improve

Training data shift: As training corpora include millions of real-world agent traces (not synthetic examples)

Time frame: Would require 3–5 years of widespread production adoption to generate organic training data at scale

Chicken-and-egg problem: Agents must work well enough to be widely deployed before training data accumulates

When Code Execution WILL Remain Superior

Foreseeable future: 3–5 years minimum, likely longer

Compounding advantage: Code corpus grows faster than tool-calling corpus (new languages, libraries, patterns constantly added)

Organic vs synthetic: Code training data is real-world usage; tool-calling remains researcher-generated

This principle also generalizes to other AI interface decisions:

Prompt engineering: Natural language descriptions beat rigid templates (models trained on real writing, not prompt formats)
API design for agents: REST + JSON beats bespoke protocols (familiar web dev patterns in training)
Error handling: Descriptive English messages beat error codes (models understand explanations better than numeric codes)

Chapter Summary

• LLMs are pattern-matching engines, not reasoning systems—they perform best on patterns seen most frequently in training
• Code corpus: 2.5–3 trillion tokens, millions of real examples; Tool-calling corpus: 60,000 synthetic examples (gap: ~50 million to 1)
• Shakespeare analogy: fluent in English (lifetime training), functional in Mandarin (month-long class)—same mismatch between code and tool schemas
• Code execution aligns with model strengths; tool calling fights model limitations—performance gap is structural, not fixable with prompts
• Best interface = closest to training distribution

"We found agents are able to handle many more tools, and more complex tools, when those tools are presented as a TypeScript API rather than directly."

— Cloudflare Engineering Blog

Next: The alternative pattern Anthropic and Cloudflare recommend—code-first architecture.

Code-First Architecture

The Pattern Anthropic and Cloudflare Actually Recommend

"With code execution environments becoming more common for agents, a solution is to present MCP servers as code APIs rather than direct tool calls."

— Anthropic Engineering Blog

From Problem to Solution

The previous chapters established the foundation. Chapter 1 revealed Anthropic's 98.7% token reduction finding. Chapter 2 exposed the structural context bloat problem—not a bug, but a fundamental design characteristic. Chapter 3 explained why: LLMs are trained on trillions of code tokens but only thousands of synthetic tool-schema examples.

This chapter answers: What do we do instead?

We'll examine the pattern Anthropic recommends in their research, how Cloudflare implemented it at production scale, and the architecture that delivers 98.7% efficiency gains while maintaining security and debuggability.

The Code-First Pattern: Architecture Overview

Flow Diagram

┌──────────┐
│   User   │
│  Query   │
└────┬─────┘
     │
     ▼
┌──────────────────┐
│   LLM (Agent)    │  ← Minimal context (no tool schemas loaded)
│  Generates Code  │
└────┬─────────────┘
     │
     ▼ Code (Python/TypeScript)
┌──────────────────────────────┐
│   Secure Sandbox Environment │
│                              │
│  ┌────────────────────────┐ │
│  │  Generated Code Runs   │ │
│  │                        │ │
│  │  Imports APIs          │ │  ← MCP servers as backing transport
│  │  Processes Data        │ │
│  │  Loops, Filters, Maps  │ │
│  │  Stores Intermediate   │ │
│  └────────────────────────┘ │
│                              │
│  Data stays inside sandbox   │
└────┬─────────────────────────┘
     │
     ▼ Compact Summary
┌──────────────────┐
│   LLM (Agent)    │  ← Receives only final results (100-500 tokens)
│ Analyzes Results │
│ Responds to User │
└──────────────────┘

Key Differences from MCP Direct Calling

No tool schemas in context: LLM doesn't see formal tool definitions upfront
Code as interface: Agent writes Python/TypeScript instead of calling JSON-RPC tools
Data stays in sandbox: Large datasets never enter LLM context
Progressive discovery: Load only needed tool documentation on-demand
Summary return: Sandbox sends back compact analysis, not raw data

The Five Components of Code-First Architecture

Component 1: Progressive Disclosure

Problem Solved: Tool definition bloat (66k+ tokens loaded upfront)

How it works: Tools presented as filesystem (directory structure). Agent explores via ls, cat, searches when needed. Only loads definitions actually required for task.

Example: User asks to get meeting notes from Google Drive and attach to Salesforce. Agent explores servers/, loads only getDocument.ts (200 tokens) and updateRecord.ts (180 tokens). Total: 380 tokens vs 15,000+ from loading all tools.

Component 2: Signal Extraction

Problem Solved: Intermediate results consuming exponential tokens

How it works: Agent generates code to process data IN the sandbox. Filtering, aggregation, transformation happen locally. Only final summary returned to LLM context.

Example: 10,000-row spreadsheet. MCP approach loads all 50,000 tokens. Code execution filters in sandbox, returns "Found 127 pending orders, 23 are high-value (>$1000), here are top 5" (~200 tokens). Reduction: 99.6%

Component 3: Control Flow in Code

Problem Solved: Multi-step workflows creating exponential context bloat

How it works: Loops, conditionals, error handling in familiar code syntax. LLM has seen these patterns millions of times (training distribution match). More reliable than chaining tool calls.

Example: Polling for deployment notification. Instead of 10 separate tool calls (each adding context), agent generates while-loop code once. Single LLM call, execution happens locally, deterministic behavior.

Component 4: Privacy-Preserving Operations

Problem Solved: Sensitive data flowing through LLM context (and potentially logged/trained on)

How it works: Data flows through sandbox, not through LLM. LLM only sees what you explicitly log/return. Optional: tokenize PII before it reaches model.

Example: Importing 1,000 customer records from Google Sheets to Salesforce. With MCP, all PII in context. With code execution, PII flows Sheets→Salesforce via sandbox. LLM sees only "Updated 1,000 leads" (no actual PII).

Component 5: State Persistence and Skills

Problem Solved: Agents can't remember, can't build reusable capabilities

How it works: Filesystem access allows saving intermediate results. Agents can write tools, save for future use. Skills library grows over time.

Example: Agent creates save-sheet-as-csv skill (day 1), then analyze-sales-trends (day 3), then combines both into monthly-sales-report skill (day 7). Compound capabilities without re-prompting.

Progressive Disclosure: On-Demand Tool Loading

Anthropic's research emphasizes: "Models are great at navigating filesystems. Presenting tools as code on a filesystem allows models to read tool definitions on-demand, rather than reading them all up-front."

Example Structure (from Anthropic blog)

servers/
├── google-drive/
│   ├── getDocument.ts
│   ├── updateDocument.ts
│   ├── createDocument.ts
│   ├── ...
│   └── index.ts
├── salesforce/
│   ├── updateRecord.ts
│   ├── createLead.ts
│   ├── ...
│   └── index.ts
└── slack/
    ├── postMessage.ts
    ├── getChannelHistory.ts
    ├── ...
    └── index.ts

Agent Workflow Example

User asks: "Get my meeting notes from Google Drive and attach to Salesforce lead"
Agent explores: ls servers/ → sees google-drive, salesforce options
Agent reads specific files:
- cat servers/google-drive/getDocument.ts (200 tokens)
- cat servers/salesforce/updateRecord.ts (180 tokens)
Total loaded: 380 tokens (not 15,000+ from loading all tools)

Signal Extraction: Processing Data in the Sandbox

The second component addresses exponential token accumulation. Instead of flowing large datasets through LLM context, code execution processes data locally and returns compact summaries.

Example: 10,000-Row Spreadsheet

❌ MCP Direct Calling

1. Call gdrive.getSheet(sheetId: 'abc123')
2. Receive all 10,000 rows in context (50,000 tokens)
3. LLM must manually filter in next tool call
4. Copy filtered results to new tool call (25,000 more tokens)
Total context: 75,000+ tokens

✓ Code Execution

Agent generates filtering code (runs in sandbox)
Data processing happens IN sandbox (not in LLM context)
Further filtering/aggregation performed locally
Returns summary: "Found 127 pending, 23 high-value, top 5: [data]"
Total context: ~500 tokens (99.3% reduction)

Code Execution Example (TypeScript)

// Agent generates this code, runs in sandbox:
const allRows = await gdrive.getSheet({ sheetId: 'abc123' });

// Data processing happens IN sandbox (not in LLM context)
const pendingOrders = allRows.filter(row => row.Status === 'pending');

// Further filtering/aggregation
const highValue = pendingOrders.filter(row => row.Amount > 1000);

// Return compact summary to LLM
console.log(`Found ${pendingOrders.length} pending orders`);
console.log(`${highValue.length} are high-value (>$1000)`);
console.log(`Top 5:`, highValue.slice(0, 5));

// LLM sees ~200 tokens (summary), not 50,000 (full dataset)

"The agent sees five rows instead of 10,000. Similar patterns work for aggregations, joins across multiple data sources, or extracting specific fields—all without bloating the context window."

— Anthropic Engineering Blog

Control Flow: Loops and Conditionals in Code

Multi-step workflows create exponential context bloat with MCP. Code execution solves this by expressing logic in familiar programming constructs that LLMs have seen millions of times in training.

Example: Polling for Deployment Notification

❌ MCP Tool Chain

1. Call slack.getChannelHistory()
2. LLM checks if "deployment complete" in results
3. If not found, sleep 5 seconds (another tool call)
4. Repeat steps 1-3 (each iteration adds context)
5. After 10 checks: 10 × full context = massive token waste

✓ Code Execution

let found = false;
while (!found) {
  const messages = await slack
    .getChannelHistory({
      channel: 'C123456'
    });
  found = messages.some(m =>
    m.text.includes('deployment complete')
  );
  if (!found)
    await new Promise(r =>
      setTimeout(r, 5000)
    );
}
console.log('Deployment received');

Privacy-Preserving Operations: PII Stays Out of Context

A critical security advantage: sensitive data flows through the sandbox, not through the LLM. The model only sees what you explicitly log or return.

"When agents use code execution with MCP, intermediate results stay in the execution environment by default. This way, the agent only sees what you explicitly log or return, meaning data you don't wish to share with the model can flow through your workflow without ever entering the model's context."

— Anthropic Engineering Blog

Privacy-Preserving Data Flow

// Agent generates code (no PII seen yet):
const sheet = await gdrive.getSheet({ sheetId: 'abc123' });

for (const row of sheet.rows) {
  await salesforce.updateRecord({
    objectType: 'Lead',
    recordId: row.salesforceId,
    data: {
      Email: row.email,    // PII flows through sandbox only
      Phone: row.phone,
      Name: row.name
    }
  });
}

// LLM sees only:
console.log(`Updated ${sheet.rows.length} leads`);
// (No actual PII in LLM context)

State Persistence and Skills: Self-Improving Agents

Filesystem access enables agents to save intermediate results and build reusable capabilities. Over time, agents develop a skills library that compounds their effectiveness.

State Persistence Example

// Query Salesforce, save results locally
const leads = await salesforce.query({
  query: 'SELECT Id, Email FROM Lead LIMIT 1000'
});

const csvData = leads.map(l => `${l.Id},${l.Email}`).join('\n');
await fs.writeFile('./workspace/leads.csv', csvData);
console.log('Saved 1,000 leads to leads.csv for later processing');

// Later execution (different session):
const saved = await fs.readFile('./workspace/leads.csv', 'utf-8');
// Agent picks up where it left off

Reusable Skill Creation

// Agent creates ./skills/save-sheet-as-csv.ts
import * as gdrive from '../servers/google-drive';

export async function saveSheetAsCsv(sheetId: string): Promise {
  const data = await gdrive.getSheet({ sheetId });
  const csv = data.map(row => row.join(',')).join('\n');
  const filename = `./workspace/sheet-${sheetId}.csv`;
  await fs.writeFile(filename, csv);
  return filename;
}

// Future sessions - agent reuses skill:
import { saveSheetAsCsv } from './skills/save-sheet-as-csv';
const csvPath = await saveSheetAsCsv('xyz789');
console.log(`Saved to ${csvPath}`);

Architecture Comparison: MCP vs Code-First

Aspect	MCP Direct Calling	Code-First Execution
Tool Loading	All upfront (15k-66k tokens)	Progressive (200-500 tokens per task)
Data Flow	Through LLM context	Through sandbox, summary only
Multi-Step	Each step = new LLM call + full context	Single LLM call generates code, local execution
Control Flow	Tool chaining (error-prone)	Native code (loops, conditionals—reliable)
Privacy	All data in LLM context	Data in sandbox, PII never exposed
State	Stateless (unless external DB)	Filesystem persistence
Debugging	Check LLM reasoning trace	Standard code debugging (logs, breakpoints)
Cost	150,000 tokens typical	2,000 tokens typical (98.7% reduction)
Latency	10+ seconds (many LLM calls)	2-3 seconds (one generation + fast execution)
Skills	Can't self-improve	Can save reusable functions

"The engineering teams at Anthropic and Cloudflare independently discovered the same solution: stop making models call tools directly. Instead, have them write code."

— Third-party analysis, MarkTechPost

When Code-First Wins (80%+ of Use Cases)

Code-first architecture excels in developer productivity scenarios where performance, cost, and privacy matter more than vendor-neutral tool registries.

Ideal Scenarios for Code-First

Single-team agents: Not multi-vendor orchestration requiring formal tool registry
Focused workflows: Specific tasks with clear requirements (log analysis, data migration, report generation)
Performance-critical systems: Where cost and latency directly impact user experience
Privacy-sensitive data: GDPR, HIPAA compliance requires PII never enters LLM context
Complex multi-step processes: Data transformation pipelines, ETL workflows, monitoring systems

Concrete Examples

Log analysis: Process gigabytes of logs, extract patterns, return insights (~500 tokens)
Data migration: Extract-transform-load pipelines moving data between systems
Report generation: Query multiple sources, aggregate metrics, format output
Monitoring: Poll systems, detect anomalies, send alerts based on thresholds
Testing: Generate test data, run assertions, report results with failure details

Trade-Offs and Limitations

Anthropic acknowledges: "Code execution introduces its own complexity. Running agent-generated code requires a secure execution environment with appropriate sandboxing, resource limits, and monitoring. These infrastructure requirements add operational overhead and security considerations that direct tool calls avoid."

What You Need:

Secure sandbox (Daytona, E2B, Modal, Cloudflare Workers)
Resource limits (CPU, memory, disk quotas)
Network isolation or controlled egress
Monitoring and logging
Error handling for malformed code

See Chapter 5 for detailed sandbox comparison and production security guidance.

Production-Ready Today

This isn't experimental research. Code-first architecture is deployed at production scale by industry leaders.

Cloudflare

Production deployment using Workers isolates. "You can start playing with this API right now when running workerd locally with Wrangler, and you can sign up for beta access to use it in production."

Docker + E2B

Strategic partnership: "Today, we're taking that commitment to the next level through a new partnership with E2B, a company that provides secure cloud sandboxes for AI agents."

Anthropic

Published research, official recommendation. Not just theory—backed by quantified evidence (98.7% token reduction) from real-world testing.

Sandbox Providers

E2B (150ms starts), Daytona (90-200ms starts), Modal (sub-second), Cloudflare Workers (milliseconds). Production-grade infrastructure available today.

Chapter Summary

Code-First Architecture: Five Components

1. Progressive Disclosure: Load tool docs on-demand (filesystem/search), not upfront
2. Signal Extraction: Process data in sandbox, return summaries (not raw datasets)
3. Control Flow in Code: Loops/conditionals in familiar syntax (not tool chains)
4. Privacy-Preserving: Data stays in sandbox (PII never in LLM context)
5. State & Skills: Filesystem persistence, reusable functions

Benefits

✓ 98.7% token reduction (150k → 2k)
✓ 5-10x latency improvement (fewer LLM calls)
✓ Better privacy (data doesn't flow through LLM API)
✓ Familiar debugging (code tools, not LLM reasoning traces)
✓ Self-improving agents (skills library)

Trade-Offs

Requires secure sandbox infrastructure
Need monitoring and resource limits
Operational complexity (but production-ready solutions exist)

"With code execution environments becoming more common for agents, a solution is to present MCP servers as code APIs rather than direct tool calls."

— Anthropic Engineering Blog

Next: Production Sandboxing

How to secure code execution in production—sandbox comparison and defense-in-depth strategies.

We'll examine E2B, Daytona, Modal, and Cloudflare Workers, covering security models, performance characteristics, and deployment patterns for enterprise environments.

Chapter 5: Production Sandboxing | Code-First Agents

Chapter 5: Production Sandboxing

Making Code Execution Safe at Scale

Chapter 4 showed you the what (code-first architecture) and the why (98.7% token reduction, privacy, state persistence). This chapter addresses the concern that stops most teams from adopting it: security.

The good news? Production-grade sandboxing is solved infrastructure. Not experimental. Not risky. Battle-tested at scale by companies like Cloudflare, Docker, Google Cloud, and NVIDIA. You don't need to build it from scratch—you choose a provider and configure it.

Security Is Not a Blocker—It's Production-Ready

When engineers first encounter "let the LLM write code that runs in your infrastructure," the immediate reaction is:

"What if it writes malicious code? What about resource exhaustion? API key leakage? Network access? This sounds like a massive attack surface."

These concerns are valid and correct. Running untrusted code is risky—if you do it naively. That's why every production code execution system uses defense-in-depth sandboxing.

Four major sandbox providers—E2B, Daytona, Modal, and Cloudflare Workers—have production deployments handling millions of executions daily. They've solved resource limits, network isolation, credential management, and monitoring. This isn't theory; it's operational reality.

The Five Layers of Defense-in-Depth

Production sandboxing uses a layered security model. If an attacker bypasses one layer, four more block them. Here's how it works:

Layer 1: Sandbox Isolation

What it does: Each code execution runs in a completely isolated environment—separate from your host OS, other sandboxes, and sensitive infrastructure.

Technologies:

Firecracker microVMs (E2B): Hardware-level virtualization, strongest isolation
gVisor containers (Modal, Google Cloud): Kernel-level isolation with intercepted syscalls
OCI containers (Daytona): Lightweight namespaces and cgroups
V8 isolates (Cloudflare): Process-level separation for JavaScript/TypeScript

Attack blocked: Code can't access host filesystem, other containers, or break out to underlying system.

Layer 2: Network Restrictions

What it does: Control exactly what the sandbox can access over the network.

Strategies:

Block internet by default: Disable fetch(), connect() globally (Cloudflare pattern)
Whitelist-only egress: Allow specific MCP servers, block everything else
Bindings instead of raw URLs: Provide pre-authorized API clients, not endpoints (hides credentials)

Attack blocked: Code can't exfiltrate data to attacker-controlled servers, can't make arbitrary HTTP requests, can't scan internal networks.

Layer 3: Resource Quotas

What it does: Prevent resource exhaustion attacks (fork bombs, memory leaks, infinite loops).

Limits enforced:

CPU time: 10-30 second execution timeout (configurable)
Memory: 100MB-1GB RAM ceiling (sandbox killed if exceeded)
Disk I/O: Rate limiting, quota on writes
Process count: Max PIDs to prevent fork bombs

Attack blocked: Malicious code that tries to consume all resources, crash host, or create denial-of-service.

Layer 4: Monitoring & Logging

What it does: Observe all code execution in real-time, detect anomalies, build audit trails.

Monitored events:

Code generated: Log every script the LLM produces
Execution outcomes: Success, failure, errors, timeouts
Anomaly detection: Flag unusual patterns (e.g., sudden spike in network calls, repeated failures)
Rate limiting: Per-user, per-agent, per-sandbox quotas

Attack blocked: Slow-burn attacks become visible, compliance audits have full trail, suspicious patterns trigger alerts.

Layer 5: Code Review (Optional Human-in-Loop)

What it does: For high-stakes operations, require human approval before execution.

Strategies:

Automatic approval: Known-safe patterns run immediately (read-only queries, data aggregation)
Manual approval: Sensitive operations wait for human review (database writes, external API calls)
Banned operations: Reject code that calls dangerous syscalls (e.g., exec(), eval() on user input)

Attack blocked: Sophisticated attacks that bypass other layers but require human judgment to catch.

Sandbox Provider Comparison

Four major providers dominate the AI code execution space. Each has different trade-offs around isolation strength, startup speed, cost, and ecosystem fit.

Provider	Technology	Cold Start	Best For
E2B	Firecracker microVMs	~150ms	AI agent tools, demos, 24hr persistence
Daytona	OCI containers + Kubernetes	90-200ms	Enterprise dev environments, stateful sessions
Modal	gVisor containers	Sub-second	ML/AI workloads, batch jobs, scale-to-millions
Cloudflare Workers	V8 isolates	Milliseconds	Disposable execution, web-scale, lowest cost

E2B: Strongest Isolation, AI-Focused

Core architecture: Firecracker microVMs (same tech AWS Lambda uses). Each execution gets a true virtual machine with hardware-level isolation.

Strengths:

Strongest security boundary (VM escape is extremely difficult)
24-hour persistence (sandbox stays alive between sessions)
Polished SDKs (Python, TypeScript, JavaScript)
Fast for microVMs (150ms cold starts)

Trade-offs:

No self-hosting (managed service only)
Higher cost at scale (VM overhead vs containers)
Designed for sandboxing, not full dev environments

Use when: You need strongest isolation for untrusted code (e.g., user-submitted agents), 24hr sessions, or rapid prototyping with excellent DX.

Daytona: Enterprise Development Environments

Core architecture: OCI containers orchestrated by Kubernetes. Focused on developer productivity at enterprise scale.

Strengths:

Lightning-fast startups (90-200ms)
Stateful persistence (not ephemeral)
Kubernetes + Helm + Terraform integration
Self-hosting option (for compliance requirements)
Programmatic SDK for automation

Trade-offs:

Requires DevOps expertise to run yourself
More complex than E2B (power comes with complexity)
Weaker isolation than microVMs (containers share kernel)

Use when: Enterprise deployment with existing Kubernetes infrastructure, compliance mandates on-prem hosting, or you need full control over environment lifecycle.

Modal: ML/AI Batch Workloads at Scale

Core architecture: gVisor containers with persistent network storage. Built for scaling to millions of executions daily.

Strengths:

Handles millions of daily executions
Sub-second cold starts
Built-in networking tunnels (safe database/API access)
Per-sandbox egress policies
Python-first platform (great for ML/AI)

Trade-offs:

Requires SDK-defined images (not fully dynamic)
Python-centric (TypeScript/JavaScript support exists but secondary)

Use when: You have high-volume batch processing needs, ML model inference pipelines, or need persistent storage across executions.

Cloudflare Workers: Millisecond Isolates at Web Scale

Core architecture: V8 isolates (same tech Chrome browser uses). Lightest-weight option by far.

Strengths:

Millisecond cold starts (20-50ms typical)
Lowest cost (isolates far cheaper than containers)
No prewarm needed (so fast you create new sandbox every execution)
Global edge deployment (low-latency worldwide)
"Bindings" model hides API keys from code (security win)

Trade-offs:

Ephemeral (no persistence between executions)
JavaScript/TypeScript only (no Python, Go, etc.)
Weaker isolation than VMs or gVisor (process-level, not kernel-level)

Use when: You need disposable execution (create fresh sandbox every time), web-scale distribution, or absolute lowest cost per execution.

The Credential Problem: No API Keys in Sandbox

One of the hardest security challenges in AI agents: how do you let the agent call APIs without giving it your API keys?

Traditional approaches fail:

Environment variables: Code can read process.env and leak them
Config files: Code can readFile() and exfiltrate
Hardcoded in code: Worst of all—keys visible in LLM-generated scripts

Cloudflare pioneered a better pattern: bindings.

How Bindings Work

Instead of giving the sandbox raw credentials, you provide pre-authorized client objects.

Example: Agent needs to call Salesforce API.

Traditional (unsafe):

// Code sees API key (can leak it)
const client = new Salesforce({
  apiKey: process.env.SALESFORCE_KEY  // ⚠️ Visible to LLM
});
await client.updateLead(...);

Bindings (safe):

// Code gets pre-authorized client (no key visible)
const client = env.salesforce;  // ✅ Already authenticated
await client.updateLead(...);

Behind the scenes, the supervisor (trusted code outside sandbox) holds the API key, intercepts calls on the env.salesforce binding, adds auth headers, and proxies to Salesforce. The sandbox never sees the key.

Comparison: MCP Security vs Code Execution Security

Critics argue: "MCP tool calling is safer than code execution because tools are constrained interfaces."

Reality check: MCP has no built-in authentication layer. Teams build custom auth on top anyway. Code execution sandboxing is more mature infrastructure.

Security Concern	MCP Direct Calling	Code Execution
Authentication	⚠️ No spec standard—teams build custom	✅ Bindings pattern hides credentials
Isolation	⚠️ Tool code runs in same process	✅ Hardware/kernel-level VM/container isolation
Resource limits	⚠️ Application-level (easy to bypass)	✅ OS-enforced (cgroups, quotas)
Network control	⚠️ Tool can make arbitrary HTTP if allowed	✅ Network namespace isolation, egress policies
Audit trail	⚠️ Must instrument every tool	✅ All executions logged by platform
Blast radius	⚠️ Malicious tool affects whole agent	✅ Contained to single sandbox instance

Verdict: Code execution with proper sandboxing is more secure than MCP direct calling because the infrastructure is purpose-built for untrusted code. MCP's "tool interface" abstraction provides a false sense of safety—you still need custom auth, rate limiting, and monitoring.

Practical Implementation: Defense-in-Depth Checklist

When deploying code execution in production, use this checklist to ensure all five layers are active:

Production Sandboxing Checklist

☑ Layer 1: Isolation

□ Choose sandbox provider (E2B, Daytona, Modal, Cloudflare)
□ Verify isolation technology matches threat model (VM > gVisor > containers > isolates)
□ Test sandbox escape attempts in staging (try cat /etc/passwd, network probes)

☑ Layer 2: Network

□ Block internet by default OR whitelist-only egress
□ Use bindings for MCP servers (don't expose raw URLs/credentials)
□ Monitor egress traffic (alert on unexpected destinations)

☑ Layer 3: Resources

□ Set execution timeout (10-30 seconds typical, adjust for workload)
□ Set memory limit (100MB-1GB, kill sandbox if exceeded)
□ Set disk quota (prevent runaway log files)
□ Set max process count (prevent fork bombs)

☑ Layer 4: Monitoring

□ Log all generated code (before execution)
□ Log all execution outcomes (success, error, timeout)
□ Set up anomaly detection (unusual patterns, rate spikes)
□ Implement rate limiting per user/agent

☑ Layer 5: Review (Optional)

□ Define "safe" vs "requires approval" operations
□ Auto-approve read-only queries, data aggregation
□ Require human approval for writes, external calls
□ Reject dangerous patterns (eval(), exec() on user input)

Common Objections Answered

Objection: "This adds operational complexity we don't have expertise for."

Response: Use managed services (E2B, Modal, Cloudflare). They handle infrastructure, you configure policies. Comparable complexity to setting up a database—not trivial, but well-documented.

Example: E2B SDK is ~10 lines of Python to get a working sandbox. Cloudflare Workers integrate with your existing Wrangler setup.

Objection: "What if the LLM generates code that looks safe but has hidden backdoors?"

Response: Layer 2 (network restrictions) and Layer 5 (code review) catch this. Even if code looks benign, it can't exfiltrate data if egress is blocked. For high-stakes operations, require human approval.

Example: Code tries fetch('https://attacker.com')—blocked by network policy. Logs show attempt, alerts fire, you investigate.

Objection: "Sandboxes can be escaped—I've seen CVEs."

Response: True, but defense-in-depth means one escape doesn't compromise everything. Firecracker/gVisor escapes are rare and patched quickly. Even if escaped, network restrictions and resource limits still apply. Compare to MCP: no isolation at all.

Risk mitigation: Use providers that auto-patch (E2B, Modal, Cloudflare), monitor security advisories, rotate sandboxes frequently.

Objection: "Our compliance team won't approve untrusted code execution."

Response: Frame it correctly: "LLM-generated code execution in hardware-isolated VMs with network restrictions, resource quotas, and full audit logs." This is more auditable than MCP (black-box tool calls). Google Cloud, NVIDIA, and Docker all endorse this pattern.

Compliance win: Every execution logged (GDPR audit trail), PII never in LLM context (privacy by design), deterministic security rules (SOC 2 control).

Chapter Summary

Key Takeaways

1 Security is solved infrastructure. E2B, Daytona, Modal, Cloudflare Workers are production-ready with millions of daily executions.
2 Defense-in-depth has five layers: Isolation, network restrictions, resource quotas, monitoring, and optional code review.
3 Bindings hide credentials. Cloudflare's pattern gives sandboxes pre-authorized clients, not raw API keys—LLM can't leak what it doesn't see.
4 Code execution is more secure than MCP. MCP has no auth layer, no isolation, no OS-enforced limits. Sandboxing is purpose-built for untrusted code.
5 Choose provider by isolation strength and speed: Firecracker (strongest, 150ms) → gVisor (kernel-level, sub-second) → containers (fast, 90ms) → isolates (milliseconds, lowest cost).

What's Next

Chapter 6 shows code-first architecture in action: a real-world debugging investigation where the agent wrote its own tools (log_analyzer.py, backup_tracer.py), processed gigabytes of data in a sandbox, and returned a 2-page analysis—all without a single MCP tool schema in context.

You've seen the mechanism (Chapter 2), the training bias explanation (Chapter 3), the architecture (Chapter 4), and the security model (Chapter 5). Now see it work.

Code-First in Production

Two real-world investigations where agents wrote their own tools

Theory is nice. Let's see code-first architecture handle real production problems. Two investigations, two different domains, same pattern: the agent writes custom tools, processes gigabytes of data in a sandbox, and returns compact analyses. No MCP. No pre-defined tool schemas. Just code.

Case Study 1: Unit Number Investigation

Debugging a production booking system bug

The agent didn't use MCP. It wrote log_analyzer.py, backup_tracer.py, processed gigabytes of data in a sandbox, and returned a 2-page analysis. Zero tool schemas in context.

Why Code-First Worked Here

This investigation required processing gigabytes of data—far too much to load into any LLM's context window:

→ booking1.log: 709 real transactions (after filtering test traffic)
→ CSV backups: Web_History, Contact, Appointment, WorkOrder objects
→ Cross-referencing needed: grep appointment IDs across multiple CSVs on remote servers
→ Total raw data: gigabytes (impossible for LLM context)

The Code-First Advantage

Instead of trying to load gigabytes into context, the agent created purpose-built analysis tools, ran them in a sandbox, and returned only the insights—reducing what would have been 100,000+ tokens of raw data to a 2,000-token summary.

Tools the Agent Created On-Demand

The agent didn't have pre-defined MCP tools. It wrote custom Python scripts as needed, each solving one part of the investigation:

1. log_analyzer.py

Purpose: Parse booking1.log into structured records

Processing: Group transactions by appointment ID, show booking1 → booking2 → booking3 sequence, flag where units were entered vs missing

Output: 157 bookings with units, 240 without units

2. backup_tracer.py

Purpose: SSH to backup server and trace appointment IDs through CRM objects

Processing: For each appointment with a unit, extract Web_History.unit__c, Contact.MailingStreet, Appointment.Street and compare

Output: 94% unit survival rate when users entered units

3. address_field_audit.py

Purpose: Show all address-related fields side-by-side across CRM objects

Processing: Highlight inconsistencies (unit in one object but missing in another)

Finding: WorkOrder.Street often empty—not the field the mobile app uses

4. unit_sanitisation_checker.py

Purpose: Test regex pattern that sanitizes user input

Bug Found: Over-aggressive pattern strips hyphens and commas

Impact: "17/43-51" → "17/4351" (mangled), affecting 6% of units

What the LLM Actually Saw

The agent processed gigabytes of logs and backups in the sandbox. What crossed back into the LLM's context? A compact 2,000-token summary:

CODE FLOW ANALYSIS (2025-11-06)

===================================

COMPLETE DATA FLOW IDENTIFIED

===================================

Step 1: Address autocomplete → booking1.php → GetSlotHandler.getSlot()

- Third-party address autocomplete strips/removes unit numbers (known limitation)
- booking1.php captures: street, suburb, state, postcode
- NO EXPLICIT UNIT FIELD
- Backend creates CRM records WITHOUT unit

Step 2: booking3.php (final form) → BookSlotHandler.newBooking()

- OPTIONAL "unit" field added here
- Updates Web_History.unit__c when provided
- Prepends "unit X " to Contact/Appointment/WorkOrder Street fields

UNIT SURVIVAL RATE (from backup analysis):

- When users enter units: 94% propagate correctly to CRM
- 6% corrupted by sanitization regex (strips hyphens/commas)

ROOT CAUSE:

1. Backend + CRM plumbing: mostly correct
2. Major real-world failure: many users never enter unit (field optional)
3. Minor bug: over-aggressive sanitization damages "17/43-51" style units

NEXT STEPS:

- Fix sanitization regex (allow hyphens, commas)
- Make unit field more prominent or required for apartments
- Confirm which CRM field mobile app reads

"The agent processed gigabytes. It returned a 2-page summary. No raw logs in context. No CSV dumps in context. Just analysis and findings."

MCP vs Code Execution: The Comparison

Two Approaches to the Same Investigation

❌ Hypothetical MCP Approach

• Tool: read_log_file → returns 100,000-token log dump to context
• LLM tries to parse in next message (error-prone, slow)
• Tool: ssh_and_grep → another 50,000 tokens added to context
• Tool: read_csv_backup → 200,000 tokens (exceeds context window)
• Each intermediate result flows through context, accumulating exponentially

Result: Context overflow, incomplete analysis, huge cost (~$50+ in tokens)

✓ Code-First Actual Approach

• Agent writes log_analyzer.py (processes locally in sandbox)
• Agent writes backup_tracer.py (SSHs, greps, summarizes results)
• Data stays in execution environment
• Returns findings in ~2,000 tokens

Result: Complete analysis, $0.01 cost, 99%+ token reduction

Key Lessons from the Investigation

Tool Creation Flexibility

What Happened

• Agent created 12+ tools total (not just the 4 listed)
• Extended initial 8 options to 20+ as investigation progressed
• Tools evolved: log cache, category cache, correlations
• Each tool purpose-built for specific sub-problem

Why It Matters

• Can't predict all needed tools upfront
• Investigation reveals new questions as it progresses
• Code-first allows dynamic tool creation
• MCP requires pre-defining all possible tools

Signal Extraction Critical

Raw data stayed in the sandbox. Only insights crossed into the LLM's context. "157 bookings with units" not "here are 157 booking records with full details..."

Impact: 99%+ token reduction by returning summaries instead of raw data

State Persistence Enabled Iteration

Agent saved intermediate CSVs and TSVs to workspace filesystem. Could resume investigation across sessions. Built reusable tools for similar future investigations.

Example: Created a log cache so repeated queries didn't re-process gigabytes each time

Privacy by Default

Customer names, addresses, phone numbers—all processed in the sandbox. Never entered the LLM's context. Only anonymized statistics and patterns were returned.

Security win: Sensitive PII stayed local; model only saw aggregate findings

Case Study 2: Autonomous Disk Architect (ADA)

When agents build their own investigation framework

The Agentic Approach: Plan-Act-Reflect-Recover

ADA is a closed-loop system demonstrating the full agent reasoning cycle. Unlike the unit-number investigation (which was a one-off debugging task), ADA is a reusable investigation framework that agents extend as they discover new patterns.

1. Plan (TASK.md / deep_space_investigation.py)

Objective: Find source of 32GB growth

Scope: 7-30 day time windows

Prioritize leads: AI development tools (Claude, Cursor, VSCode), system caches, diagnostics logs

2. Act (disk_analyzer.py / DISK_ANALYZER_ALLOCATION.py)

High-performance scan: Multi-threaded, queue-based filesystem traversal

Allocate storage: Categorize every file by app/purpose (Python, Xcode, Claude, System Cache)

External tool use: Query Tavily API to investigate unknown storage paths

3. Reflect (FINDINGS.md / analyze_storage.py)

Pandas analysis: Identify folders responsible for 80% of growth

Daily growth rates: Track consumption velocity

Suspicious patterns: Large files created during non-working hours

4. Recover (cleanup_storage.py / cleanup_disk.sh)

Generate script: Tailored, executable shell script targeting specific folders

Safety-guarded: Dry-run mode, validation checks

Measurable outcome: 15-20GB recovered from CoreSpotlight, System Diagnostics, Acronis Cache

The Tool Evolution Pattern

Here's where ADA demonstrates code-first's advantage over MCP: the agent extended its own toolset as the investigation progressed.

"It is anti-MCP—writes its own Python as it goes. Creates its own tools, loops. I initially created some tools to cache files, but then let it create/extend more tools. Eventually it added about 12 options to my original 8 options, then created a 2nd cache by category, and more than 20 other tools."

Tool Evolution Timeline

Initial Toolset (8 tools)

• Basic filesystem scanner
• File cache (pickle)
• Size aggregator
• Growth calculator
• Top folders reporter
• Date filter
• Extension analyzer
• Cleanup script generator

Agent-Extended Toolset (20+ tools)

• Category-based cache (2nd cache layer)
• AI tool footprint tracker (Claude, Cursor, VSCode)
• Temporal growth windows (1, 3, 7, 30 days)
• Culprit identification (80% rule)
• Suspicious pattern detector
• App allocation engine (100% allocation model)
• Tavily API integration (unknown path investigation)
• Dynamic work splitting (threshold-based optimization)
• ...12+ more specialized analysis tools

Technical Sophistication

ADA showcases production-grade patterns that would be difficult to achieve with MCP's pre-defined tool approach:

Concurrent Processing Architecture

Multi-threaded, queue-based consumer-producer model for filesystem traversal. Dynamic work splitting when directories exceed latency thresholds (saved to needs_split.txt for future optimization).

Technology: Python threading, Rich CLI for real-time progress bars

100% Allocation Model

Comprehensive path-matching patterns allocate every file to a category (Python, Xcode, Claude, System Cache). No vague "System Data"—actionable insights only.

Example: Tracked storage used by AI dev tools (Claude, Cursor, VSCode) separately for resource planning

Agentic Web Search Integration

Tavily API integration for investigating unknown/unclassified folders. Agent autonomously queries external data sources to enhance categorization quality.

Pattern: Agent using external tool to improve its own data quality—classic agentic behavior

Decoupled Visualization Layer

FastAPI microservice with REST endpoints (/api/allocation, /api/stats) and Chart.js web interface. Heavy analysis processing separated from real-time reporting.

Architecture win: Data pipeline delivers insights separately from core processing engine

Measurable Outcomes

Metric	Result
Storage recovered	15-20GB (targeting CoreSpotlight, System Diagnostics, Acronis Cache)
AI tool footprint tracked	Claude, Cursor, VSCode storage usage explicitly categorized
Tools created	Started with 8, agent extended to 20+ specialized analysis tools
Scan performance	Multi-threaded concurrent processing with dynamic work splitting
Deployment	FastAPI web interface + Python CLI tools + Bash cleanup scripts

Why MCP Couldn't Do This

ADA's tool evolution pattern is fundamentally incompatible with MCP's architecture:

Architectural Comparison

❌ MCP Limitations

• Tools must be pre-defined upfront—can't extend during investigation
• Each new tool = new schema loaded into context (token bloat)
• Agent can't write loops, conditionals, or complex data transformations
• Multi-threaded processing requires orchestration outside MCP
• No way to persist intermediate caches between tool calls

20+ tools × 500 tokens each = 10,000+ token tax before analysis even starts

✓ Code-First Capabilities

• Agent writes new tools as investigation reveals needs
• Tools are Python files in filesystem (no context bloat)
• Full programming language: loops, threads, Pandas, API calls
• Natural state persistence (pickle caches, DataFrame saves)
• Can build entire frameworks (CLI + API + web interface)

Agent only loads summaries of analysis results, not every tool definition

Chapter Summary

• Two case studies: Unit number debugging (one-off investigation) and ADA (reusable framework)
• Gigabyte-scale processing: Agents handle massive datasets in sandbox, return compact summaries
• Tool evolution: ADA started with 8 tools, agent extended to 20+ as investigation progressed
• Signal extraction: 99%+ token reduction by returning insights instead of raw data
• Production patterns: Multi-threading, Pandas analysis, API integration, web interfaces
• MCP limitations exposed: Can't extend tools during investigation, can't build complex frameworks

What's Next

Code-first wins for most use cases. But where does MCP actually shine? Testing with Playwright, sub-agent orchestration, and transport-layer use cases reveal the nuanced answer.

Chapter 7: The MCP Sweet Spot—when standards win.

The MCP Sweet Spot

When Standards Actually Win

MCP isn't dead—it's repositioned. There are specific use cases where MCP's standardization beats code-first flexibility. Here's when to choose each.

TL;DR

• Code-first is the default for 80%+ of scenarios, but MCP excels in specific niches: testing, sub-agent orchestration, and enterprise governance.
• MCP Playwright is "awesome" for browser automation—small tool count, formal schemas add value for testing workflows.
• Sub-agent pattern minimizes context blast: each sub-agent uses <20 MCP tools, writes findings to MD file (like a subroutine), main agent synthesizes.
• Hybrid approach (code interface + MCP transport) delivers both performance and standardization—best of both worlds.

Not "MCP Bad, Code Good"

Both patterns have valid use cases. The key insight isn't dogmatic rejection of MCP—it's understanding when standardization adds value and when it becomes overhead.

Code-first is the default for 80%+ scenarios. This chapter explores the 20% where MCP shines.

Engineering maturity means knowing when to use each pattern, not religious adherence to one approach.

Use Case 1: Testing and Browser Automation

Why MCP Playwright Wins

The Playwright MCP server provides standardized browser automation actions through a well-defined, stable interface. For testing workflows, this formal schema approach delivers real advantages over code generation.

MCP Playwright vs Code-First for Testing

✓ MCP Playwright Strengths

• Standardized actions: navigate, click, fill, screenshot, assert
• Small tool count (15-20 tools total)—no context bloat
• Formal schemas enforce consistent test patterns
• Validation critical: assertions must be precise
• Stable API rarely changes
• Repeatability > flexibility for test suites

✗ Code-First Alternative (less ideal)

• Would generate Playwright code each time
• Risk of subtle variations in selectors, waits
• Harder to enforce consistent test patterns
• No advantage since tool count is small
• Generated code may drift between test runs
• Less deterministic for regression testing

When Formal Schemas Help

Validation Critical

Test assertions must be precise. Formal schemas catch type errors, missing parameters, invalid selectors at tool-call time.

Example: playwright.assertText(selector, expected) validates both parameters exist before execution.

Small Tool Catalog

Tool count < 20 avoids context bloat. ~15 Playwright tools = ~3,000 tokens total.

Size target: Fits easily in context window without degrading agent reasoning.

Stable API

Playwright's browser automation API is mature and rarely changes. Tool definitions written once, reused thousands of times.

Contrast: Rapidly evolving internal APIs favor code-first (schema changes are friction).

# MCP Playwright Tool Example

playwright.navigate(url)
playwright.click(selector)
playwright.fill(selector, value)
playwright.screenshot(path)
playwright.assertText(selector, expected)

# Total: ~15 tools, ~3,000 tokens
# Still fits easily in context, formal validation valuable

Use Case 2: Sub-Agent Orchestration

The sub-agent pattern is where MCP's standardization shines without triggering context bloat. Main agent delegates to specialized sub-agents, each with focused tool access.

Architecture: Contained Context Blast Zones

┌─────────────────┐
│   Main Agent    │  ← Minimal context (orchestration only)
│  (Orchestrator) │
└────────┬────────┘
         │
         ├──► Sub-Agent A (MCP: GitHub tools only)
         │    └─► Writes findings to github_analysis.md
         │
         ├──► Sub-Agent B (MCP: Database tools only)
         │    └─► Writes findings to db_report.md
         │
         └──► Sub-Agent C (MCP: Monitoring tools only)
              └─► Writes findings to alerts.md

Why This Works

Context containment: Each sub-agent has <20 tools (avoids bloat). The "blast zone" stays within that sub-agent's execution.

Sub-agent output: Compact markdown file, not streaming context. Main agent reads MD files and synthesizes—never sees intermediate tool calls.

Subroutine pattern: Similar to function encapsulation in code. Sub-agent = subroutine with inputs (task), outputs (MD file), internal complexity hidden.

"Sub agents: the whole wasting/blowing up context blast zone is minimized. You get a sub-agent to use the MCP tools, and writes to a MD file when it's finished. This, funnily enough, is more like writing a subroutine—very similar."

— Author, from project notes

Benefits Over Monolithic Agent

✓ MCP standardization helps sub-agent swapping: Replace GitHub sub-agent with GitLab sub-agent without touching main orchestrator.
✓ Clear boundaries: GitHub sub-agent only accesses GitHub MCP server—no accidental cross-tool leakage.
✓ Governance: Can audit which sub-agent uses which tools, enforce least-privilege access.
✓ Parallelization: Run 3 sub-agents concurrently, each investigating different domain. Main agent waits for all MD files, then synthesizes.

Example Workflow: Infrastructure Health Analysis

Step 1: User asks: "Analyze our infrastructure health"
Step 2: Main agent spawns 3 sub-agents:
- • GitHub sub-agent: Check PR velocity, open issues, deployment frequency
- • Database sub-agent: Check query performance, connection pools, slow queries
- • Monitoring sub-agent: Check error rates, P95 latency, alert history
Step 3: Each sub-agent:
- • Uses MCP tools within its domain
- • Processes intermediate results locally
- • Writes 1-2 page MD summary with key findings
Step 4: Main agent:
- • Reads 3 MD files (total: ~2,000 tokens)
- • Synthesizes cross-domain insights
- • Returns unified infrastructure health report
Total context in main agent: Never sees sub-agent intermediate steps—only final summaries.

Use Case 3: Enterprise Multi-Agent Orchestration

For large enterprises coordinating 10+ agents from different vendors, MCP's neutral interface provides governance and interoperability benefits that outweigh the performance tax.

Why MCP Helps at Enterprise Scale

Neutral interface: No vendor lock-in to one SDK. Agent from Vendor X can call Tool from Vendor Y through standard MCP protocol.

Centralized tool registry: IT governance team maintains catalog of approved tools, tracks usage, enforces policies.

Permission layers: Control which agents access which tools at protocol level. Security team can audit all agent→tool interactions.

Interoperability: Agents built by different teams/vendors coordinate without custom integration code.

Who This Applies To

• Large enterprises: Fortune 500, regulated industries
• Heterogeneous ecosystems: Agents from multiple vendors, built by different teams
• Compliance/governance: Strict requirements for auditability, access control
• <20% of teams: Most teams build focused, single-vendor agents and don't need this complexity

Trade-Off Accepted

When Governance > Performance

Enterprise orchestration scenarios where the trade-off makes sense:

• Performance cost acceptable: Token bloat is manageable when budgets are enterprise-scale and compliance is critical.
• Speed less critical than auditability: Regulated industries prioritize "can we prove what the agent did?" over "did it respond in 2 seconds?"
• Cost less sensitive: Enterprise IT budgets absorb the 10-50x token overhead if it delivers governance benefits.

"MCP may have value for large enterprises coordinating multiple agents from different vendors, where standardization enables governance and interoperability."

— From pre-think analysis

Use Case 4: MCP as Backing Transport (Best of Both Worlds)

The hybrid pattern delivers both code-first performance and MCP standardization by using code as the interface while MCP handles transport, auth, and connection management.

Hybrid Architecture

Agent writes TypeScript code
  ↓
import { gdrive, salesforce } from './servers'
  ↓
./servers/* files use MCP protocol under the hood
  ↓
MCP handles: auth, connections, protocol negotiation
  ↓
Agent never sees MCP schemas in context

"Cloudflare published similar findings, referring to code execution with MCP as 'Code Mode.' The core insight is the same: LLMs are adept at writing code and developers should take advantage of this strength to build agents that interact with MCP servers more efficiently."

— Anthropic, "Code execution with MCP" blog post

Benefits: Code Interface + MCP Transport

✓ Code as interface: Training distribution match (LLMs excel at TypeScript/Python, not tool schemas).
✓ MCP as transport: Standardization for auth, connections, OAuth flows without custom integration code.
✓ No token bloat: MCP schemas not loaded into context—agent sees only TypeScript function signatures.
✓ Interoperability: Compatible with MCP ecosystem—can swap MCP servers without changing agent code.

Example: Code-First with MCP Backing

// agent_code.ts

import { getDocument } from './servers/google-drive';
import { updateRecord } from './servers/salesforce';

const doc = await getDocument({ documentId: 'abc123' });
await updateRecord({
objectType: 'Lead',
recordId: 'xyz',
data: { Notes: doc.content }
});

// Under the hood:
// - ./servers/google-drive/getDocument.ts calls MCP server
// - MCP handles OAuth, token refresh, connection pooling
// - Agent just writes normal TypeScript

What Makes Hybrid Compelling

Performance Win

Code-first avoids context bloat. Agent generates compact TypeScript, not verbose tool-call JSON.

Example: ~2,000 tokens vs ~150,000 tokens for equivalent MCP workflow.

Standardization Win

MCP provides interop layer—swap Google Drive MCP server for Dropbox MCP server without changing agent code.

Example: Change one line in ./servers/ config, agent code unchanged.

Flexibility Win

Can add/remove MCP servers without retraining agent or updating prompts.

Example: New Salesforce server → add import, agent discovers via filesystem.

Familiarity Win

Developers write normal code with types, autocomplete, linting—not JSON schemas.

Example: TypeScript LSP provides type hints for getDocument() parameters.

Decision Framework: MCP vs Code-First

Engineering maturity means choosing the right tool for the job. Here's when each pattern wins:

Decision Path: Which Pattern to Choose?

✓ Choose MCP Direct Calling When:

• Tool count < 20 (small enough to avoid bloat)
• Formal validation critical (testing, compliance, assertions)
• Standardization > performance (enterprise orchestration, governance)
• Sub-agent pattern (contained context blast with MD output)
• Stable API (rarely changing tool definitions, mature interfaces)

Outcome: MCP's formal schemas add value, token overhead is acceptable.

✓ Choose Code-First Execution When:

• Tool count > 20 (bloat becomes significant issue)
• Performance critical (cost, latency, throughput matter)
• Complex workflows (multi-step, data processing, transformations)
• Privacy-sensitive data (PII must stay in sandbox, not context)
• Developer productivity (single-team agents, rapid iteration)
• Large datasets (gigabytes of logs/backups can't fit in context)

Outcome: 98.7% token reduction, training distribution alignment, flexible composition.

✓ Choose Hybrid (Code + MCP Transport) When:

• Want both performance AND standardization
• Multiple MCP servers, but need efficiency (can't pay token tax)
• Can invest in code-generation layer (wrapper files in ./servers/)
• Willing to maintain abstraction layer (TypeScript wrappers around MCP)

Outcome: Best of both worlds—code interface for performance, MCP transport for interop.

Chapter Summary

MCP's Real Strengths:

1. Testing/browser automation: MCP Playwright excels (small tool count, formal validation adds value)
2. Sub-agent orchestration: Subroutine pattern minimizes context blast, MD file output encapsulates complexity
3. Enterprise multi-agent coordination: Governance, auditability, vendor-neutral interoperability
4. Backing transport for code-first agents: Hybrid pattern delivers performance + standardization

Code-First Remains Default:

• 80%+ of use cases favor code execution
• Performance, flexibility, privacy advantages proven
• Training distribution alignment (models excel at code, not schemas)

Mature Engineering Position:

• Not "kill MCP"—understand when to use each pattern
• Code-first default, MCP for specific niches (testing, sub-agents, enterprise)
• Hybrid pattern (code interface + MCP transport) is promising future direction

"Sub agents: the whole wasting/blowing up context blast zone is minimized. You get a sub-agent to use the MCP tools, and writes to a MD file when it's finished. This, funnily enough, is more like writing a subroutine."

— Author

Next Chapter Preview:

Chapter 8 provides an implementation roadmap—how to apply code-first pattern in your projects, migration strategies for existing MCP implementations, and team buy-in tactics.

Chapter 8: Implementation Roadmap | Code-First Agents

Implementation Roadmap

From Architectural Decision to Production Deployment

"Next time someone says 'we need to use MCP because it's the standard,' ask them: what problem are we solving—interoperability or performance? For most teams, the answer is performance. And for that, code wins."

TL;DR

• Eight-week roadmap takes you from assessment to production deployment with gradual rollout
• Migration strategy: hybrid approach lets you keep MCP as backing transport while agents write code
• ROI example: $153,840 annual savings with less than one month payback period
• Leadership pitch: frame as performance win backed by Anthropic and Cloudflare, not architectural failure

The 8-Week Implementation Timeline

This roadmap is designed for progressive adoption with minimal disruption. You'll validate the pattern early, build confidence with real metrics, and scale systematically.

Phase 1: Assess Current State (Week 1)

Measure Token Waste:

Enable logging in your LLM provider (OpenAI, Anthropic, etc.)
Track tokens per request: input vs output
Calculate overhead: (total tokens - useful work) / total tokens
Identify workflows with >50% overhead

Identify High-Impact Candidates:

Multi-step workflows (> 3 tool calls)
Large dataset processing (logs, databases, CSVs)
Privacy-sensitive data flows
Workflows causing user complaints (slow, expensive)

Team Readiness Check:

Do you have engineers comfortable with Python/TypeScript?
Can you run Docker/Kubernetes? (for sandboxing)
Is there executive support for infrastructure investment?

Phase 2: Sandbox Proof-of-Concept (Week 2-3)

Choose Sandbox Provider:

Quick start: E2B (batteries-included, cloud-hosted)
Enterprise: Daytona (self-hosted, Kubernetes)
Scale: Cloudflare Workers (millisecond starts, lowest cost)
ML workloads: Modal (GPU support, large datasets)

Set Up Secure Environment:

Resource limits: 100MB memory, 10s timeout (start conservative)
Network policies: block internet OR whitelist only
File system: isolated per execution
Logging: capture all stdout/stderr

Test with Single Workflow:

Pick simplest multi-step workflow
Write code-generation prompt: "Generate Python code to..."
Execute in sandbox, return summary
Compare: tokens used vs MCP equivalent

Success Criteria:

Code executes successfully in sandbox
Token reduction > 50% vs MCP approach
No security incidents in testing
Latency acceptable (< 5s total)

Phase 3: Productionize One Agent (Week 4-6)

Build Code-First Agent:

Prompt engineering: guide agent to write clean, focused code
Error handling: sandbox returns stack traces, agent can debug
Progressive disclosure: provide tool docs on-demand (filesystem or search)
Skills library: save reusable functions to ./skills/

Implement Monitoring:

Token usage per request
Execution time (sandbox startup + code run)
Error rates (code generation failures, runtime errors)
Cost tracking (sandbox + LLM costs)

Safety Guardrails:

Code review for high-risk operations (optional)
Rate limiting (prevent abuse)
Audit logging (who ran what, when)
Alert on anomalies (unusual patterns, quota breaches)

Test in Staging:

Run production-like workloads
Load testing (concurrent requests)
Failure mode testing (malformed code, resource exhaustion)
Security testing (attempt to escape sandbox)

Phase 4: Team Rollout (Week 7-8)

Documentation for Developers:

When to use code-first vs MCP (decision framework)
Sandbox access instructions
Example agent prompts
Debugging guide (reading logs, attaching debuggers)

Training Sessions:

Show token comparison (MCP vs code-first)
Live demo: agent writing and executing code
Security best practices
Q&A on common concerns

Gradual Migration:

Don't rewrite all agents at once
Start with new agents (greenfield)
Migrate high-waste workflows first
Keep low-tool-count agents on MCP if working

Migration Strategy: Existing MCP Agents

// Before (MCP direct):

tool_call("salesforce.updateRecord", { objectType: "Lead", ... })

// After (code-first with MCP transport):

import { updateRecord } from './servers/salesforce'; await updateRecord({ objectType: "Lead", ... }); // (updateRecord.ts calls MCP under the hood)

Refactor Sequence

Week 1: Create ./servers/ wrappers for existing MCP tools
Week 2: Update agent prompts to generate code (not call tools)
Week 3: Remove tool schemas from context (keep as code imports)
Week 4: Measure improvement, iterate

"If you've already built on MCP, this isn't wasted effort. You learned what works and what doesn't—that's valuable. And the good news: code execution can coexist with MCP. Use MCP for transport/auth, write code as the interface. You can migrate progressively, not throw everything away."

Selling the Change to Leadership

Frame this as a performance and cost optimization, not an admission that "we were wrong." The data supports the transition.

Key Talking Points

Frame as Performance + Cost Win

• Token costs reduced 50-98%
• Latency improved 5-10x
• Better privacy (GDPR, HIPAA compliance)
• Proven by Anthropic and Cloudflare (not risky bet)

Validated by Industry Leaders

• Anthropic published the research
• Cloudflare deployed in production
• NVIDIA, Google Cloud recommend approach
• Docker uses similar patterns

Address Common Concerns

"Is this secure?"

Sandboxing is production-ready (E2B, Daytona, Modal, Cloudflare). Defense-in-depth: isolation + resource limits + network controls. NVIDIA, Google Cloud, Docker all recommend this approach.

"What if MCP improves?"

Can switch back if MCP solves bloat (unlikely—training distribution issue). Code-first and MCP aren't mutually exclusive (hybrid approach). Future-proof: code is more flexible than formal schemas.

"Do we have expertise?"

Engineers already know Python/TypeScript (core competency). Sandbox providers handle infrastructure complexity. Lower learning curve than mastering MCP quirks.

ROI Calculation Example

Current State (MCP): - 1,000 agent requests/day - 150,000 tokens average per request - 150M tokens/day = $450/day ($13,500/month) After Code-First: - 1,000 agent requests/day - 2,000 tokens average per request - 2M tokens/day = $6/day ($180/month) Financial Impact: Savings: $13,320/month Sandbox costs: ~$500/month (Daytona/Modal) Net savings: $12,820/month ($153,840/year) Payback period: < 1 month

Common Pitfalls and Solutions

Pitfall 1: Over-engineering sandbox security

Don't build custom isolation from scratch. Start with provider defaults (E2B, Daytona already secure). Use defense-in-depth, not perfect security.

Solution: Trust proven infrastructure, layer controls instead of building from zero.

Pitfall 2: Generating overly complex code

Agents can produce bloated, hard-to-debug code if prompts aren't well-structured.

Solution: Prompt engineering—guide agent to write simple, focused code. Break large tasks into smaller blocks. Use skills library for common patterns.

Pitfall 3: Not monitoring token usage

Assuming savings without measuring. Hidden costs: LLM calls to generate code.

Solution: Track everything, compare before/after. Include code-generation tokens in total cost analysis.

Pitfall 4: Trying to migrate everything at once

Big-bang rewrites fail. Teams get overwhelmed, quality drops, rollback becomes impossible.

Solution: Gradual rollout, high-impact workflows first. Validate each migration step before expanding.

Success Metrics to Track

Metric	Calculation	Target
Token Reduction	(MCP tokens - Code tokens) / MCP tokens	>70% reduction
Latency Improvement	MCP time / Code time	3-5x faster
Cost Savings	Monthly LLM bill before vs after	>$10k/month saved (at scale)
Error Rates	Code execution failures vs MCP tool failures	<5% failure rate
Developer Satisfaction	Survey team on agent workflow	>80% prefer code-first

Phase 5: Scale and Optimize (Ongoing)

After successful rollout, continue to refine and expand the pattern across your organization.

Expand to More Agents

Apply pattern to additional workflows. Build skills library (reusable functions across agents). Share learnings across team.

Cost Optimization

Adjust sandbox resource limits (right-size). Cache commonly-used tool docs (reduce re-fetching). Batch operations where possible (reduce LLM calls).

Continuous Improvement

Track metrics: token costs, latency, error rates. A/B test: code-first vs MCP for ambiguous cases. Iterate on prompt engineering (better code generation). Update skills library (agents get smarter over time).

Conclusion: The Future Is Code-First

What We've Learned

Anthropic's research: 98.7% token reduction with code execution
Root cause: LLMs trained on code, not tool schemas (training distribution bias)
Code-first architecture: progressive disclosure, signal extraction, privacy, skills
Sandboxing is solved (E2B, Daytona, Modal, Cloudflare)
Real case studies: gigabyte-scale investigation with minimal tokens
MCP has niches (testing, sub-agents, enterprise orchestration)
Implementation is gradual (not big-bang rewrite)

What Changes

Individual developers: Stop fighting MCP limitations, write agents like normal code
Engineering teams: Faster, cheaper, more reliable agents
Industry: Honest conversations about performance vs ecosystem tradeoffs

"This is engineering self-correction, not failure. Anthropic and Cloudflare showed intellectual honesty. They published research that challenges their own work. The industry learns: standards must work, not just exist."

Your Next Step

Measure your current token waste (Week 1)
Test code-first with one workflow (Week 2-3)
Share findings with your team
Make informed architectural decision (performance vs governance)

Final Call to Action

Next time someone says "we need to use MCP because it's the standard," ask them: what problem are we solving—interoperability or performance?

For most teams, the answer is performance. And for that, code wins. Not because it's trendy. Because the data proves it.

"Code execution with MCP enables agents to use context more efficiently by loading tools on demand, filtering data before it reaches the model, and executing complex logic in a single step."

— Anthropic Engineering Blog

References & Sources

This ebook synthesizes research from primary sources (Anthropic, Cloudflare), industry practitioners (NVIDIA, Google Cloud, Docker), production sandbox providers (E2B, Daytona, Modal), and community analysis. All sources were accessed and verified between November 2024 and January 2025. Direct quotes appear throughout the chapters with inline citations.

Primary Sources (Creators & Implementers)

Anthropic: Code execution with MCP
The foundational research documenting the 98.7% token reduction (150,000 → 2,000 tokens) when presenting MCP servers as code APIs rather than direct tool calls. Explains progressive disclosure, signal extraction, and privacy benefits of code-first patterns.
https://www.anthropic.com/engineering/code-execution-with-mcp

Cloudflare: Code Mode
Independent validation of Anthropic's findings. Introduces "Code Mode" as production implementation of code-first agent architecture. Explains training distribution mismatch: "LLMs have an enormous amount of real-world TypeScript in their training set, but only a small set of contrived examples of tool calls."
https://blog.cloudflare.com/code-mode/

MCP Context Bloat Analysis

MCP Context Bloat (jduncan.io)
Real-world analysis of token consumption in multi-server MCP deployments. Documents 50,000+ token baseline before agent interaction. Notes GitHub MCP server alone consumes 55,000 tokens for 93 tools (Simon Willison research).
https://jduncan.io/blog/2025-11-07-mcp-context-bloat/

Optimising MCP Server Context Usage (Scott Spence)
Case study of Claude Code session consuming 66,000+ tokens before conversation start. Breakdown shows mcp-omnisearch server alone used 14,214 tokens for 20 tools with verbose descriptions, parameters, and examples.
https://scottspence.com/posts/optimising-mcp-server-context-usage-in-claude-code

Cursor's 40-Tool Barrier (Medium - Sakshi Arora)
Documents practical limit of ~40 tools in Cursor's MCP implementation. Explains challenges in AI tool selection accuracy and LLM context window constraints at scale.
https://medium.com/@sakshiaroraresearch/cursors-40-tool-tango-navigating-mcp-limits-213a111dc218

The MCP Tool Trap (Jentic)
Analysis of token bloat mechanism and declining reasoning performance. Explains how tool descriptions crowd context window, reducing space for project context and agent chain-of-thought.
https://jentic.com/blog/the-mcp-tool-trap

Anthropic Just Solved AI Agent Bloat (Medium - AI Software Engineer)
Community analysis of MCP context consumption in complex workflows (150,000+ tokens). Discusses requirement to load all tool definitions upfront and pass intermediate results through context window.
https://medium.com/ai-software-engineer/anthropic-just-solved-ai-agent-bloat-150k-tokens-down-to-2k-code-execution-with-mcp-8266b8e80301

We've Been Using MCP Wrong (Medium - Meshuggah22)
Engineering validation of independent convergence by Anthropic and Cloudflare teams on code-first pattern. Explains how both organizations discovered same solution without coordination.
https://medium.com/@meshuggah22/weve-been-using-mcp-wrong-how-anthropic-reduced-ai-agent-costs-by-98-7-7c102fc22589

Training Data & LLM Research

OpenCoder: Top-Tier Open Code LLMs
Documents scale of code corpus in pre-training: 2.5 trillion tokens (90% raw code, 10% code-related web data). References RefineCode dataset with 960 billion tokens across 607 programming languages.
https://opencoder-llm.github.io/

Even LLMs Need Education (Stack Overflow Blog)
Explains role of Stack Overflow's curated, vetted programming data in training LLMs that understand code. Discusses quality data as foundation for code generation capabilities.
https://stackoverflow.blog/2024/02/26/even-llms-need-education-quality-data-makes-llms-overperform/

Function Calling with Open-Source LLMs (Medium - Andrei Rushing)
Technical explanation of function-calling training with special tokens. Documents Mistral-7B-Instruct tokenizer defining [AVAILABLE_TOOLS], [TOOL_CALLS], [TOOL_RESULTS] tokens. Discusses performance gap between native function-calling and synthetic training.
https://medium.com/@rushing_andrei/function-calling-with-open-source-llms-594aa5b3a304

mlabonne/llm-datasets (GitHub)
Catalog of specialized fine-tuning datasets for function calling. Lists xlam-function-calling-60k (60k samples) and FunReason-MT (17k samples) as representative datasets—orders of magnitude smaller than code corpora.
https://github.com/mlabonne/llm-datasets

Awesome LLM Pre-training (GitHub - RUCAIBox)
Repository documenting large-scale pre-training datasets. References DCLM (3.8 trillion tokens from web pages) and Dolma (3 trillion tokens) corpus, demonstrating scale difference vs function-calling datasets.
https://github.com/RUCAIBox/awesome-llm-pretraining

LLMDataHub (GitHub - Zjh-819)
Tracks code-specific training data including StackOverflow posts in markdown format (35GB raw data). Illustrates real-world code examples available during pre-training phase.
https://github.com/Zjh-819/LLMDataHub

Production Sandboxing Solutions

Mastering AI Code Execution with E2B (ADaSci)
Overview of E2B's Firecracker microVM-based sandboxes for AI-generated code execution. Describes isolated cloud environments functioning as small virtual machines specifically designed for LLM output safety.
https://adasci.org/mastering-ai-code-execution-in-secure-sandboxes-with-e2b/

Top Modal Sandboxes Alternatives (Northflank)
Comparative analysis of E2B, Daytona, Modal, and Cloudflare Workers. Documents 150ms E2B cold starts, gVisor containers in Modal, and millisecond-latency isolates in Cloudflare. Provides production feature comparison matrix.
https://northflank.com/blog/top-modal-sandboxes-alternatives-for-secure-ai-code-execution

Open-Source Alternatives to E2B (Beam)
Deep dive on Daytona's 90-200ms cold starts with OCI containers. Explains AGPL-3.0 licensing and positioning as open-source platform for both AI code execution and enterprise dev environment management.
https://www.beam.cloud/blog/best-e2b-alternatives

Awesome Sandbox (GitHub - restyler)
Curated list of sandboxing technologies for AI applications. Documents E2B's Firecracker microVMs and Daytona's stateful persistence with robust SDK for programmatic control.
https://github.com/restyler/awesome-sandbox

AI Sandboxes: Daytona vs microsandbox (Pixeljets)
Analyzes Daytona's enterprise integration capabilities. Discusses Kubernetes deployment with Helm charts, Terraform infrastructure management, and DevOps expertise requirements for self-hosting.
https://pixeljets.com/blog/ai-sandboxes-daytona-vs-microsandbox/

Top AI Code Sandbox Products (Modal Blog)
Modal's production capabilities overview: scaling to millions of daily executions, sub-second starts, networking tunnels, and per-sandbox egress policies for database/API interaction without infrastructure exposure.
https://modal.com/blog/top-code-agent-sandbox-products

Docker + E2B: Building the Future of Trusted AI
Partnership announcement providing developers fast, secure access to hundreds of real-world tools without sacrificing safety or speed. Discusses production readiness of sandboxed execution.
https://www.docker.com/blog/docker-e2b-building-the-future-of-trusted-ai/

Security & Best Practices

How Code Execution Drives Key Risks in Agentic AI Systems (NVIDIA)
NVIDIA AI red team analysis positioning execution isolation as mandatory security control. Documents RCE vulnerability case study in AI-driven analytics pipeline. Emphasizes treating LLM-generated code as untrusted output requiring containment.
https://developer.nvidia.com/blog/how-code-execution-drives-key-risks-in-agentic-ai-systems/

Secure Code Execution in AI Agents (Medium - Saurabh Shukla)
Defense-in-depth approach for mitigating LLM code execution risks. Explains sandboxing as restricting execution to limited environment with controlled host system access.
https://saurabh-shukla.medium.com/secure-code-execution-in-ai-agents-d2ad84cbec97

Agent Factory: Securing AI Agents in Production (Google Cloud)
Google Cloud's production security model using gVisor sandboxing on Cloud Run. Documents OS isolation and ephemeral container benefits preventing long-term attacker persistence.
https://cloud.google.com/blog/topics/developers-practitioners/agent-factory-recap-securing-ai-agents-in-production

Industry Adoption & Validation

Anthropic Turns MCP Agents Into Code First Systems (MarkTechPost)
Third-party engineering analysis validating Anthropic's approach as "sensible next step" directly attacking token costs of tool definitions and intermediate result routing through context windows.
https://www.marktechpost.com/2025/11/08/anthropic-turns-mcp-agents-into-code-first-systems-with-code-execution-with-mcp-approach/

Note on Research Methodology

All sources were accessed and verified between November 2024 and January 2025. Primary sources (Anthropic, Cloudflare) were extracted with advanced search depth using Tavily search API. Citations were cross-validated across multiple independent sources to ensure accuracy. Evidence quality follows three-tier classification: Tier 1 (creators/implementers), Tier 2 (enterprise practitioners), and Tier 3 (community analysis). All quoted material appears exactly as published in original sources with contextual attribution.

Total sources cited: 40+ unique URLs
Total quoted snippets: 60+ direct quotes with inline attribution
Research timeframe: November 2024 - January 2025
Primary search tool: Tavily (advanced depth mode) + content extraction

Why Code-First AgentsBeat MCP by 98.7%

What You'll Learn

The Confession

TL;DR

The Rare Moment of Engineering Honesty

The Numbers: 98.7% Token Reduction

Independent Validation: Cloudflare Code Mode

Production Scale

Independent Discovery

Cost Validation

What This Isn't: Reframing as Engineering Self-Correction

The Two Problems Anthropic Identified

Problem 1: Tool Definitions Overload the Context Window

Problem 2: Intermediate Results Consume Additional Tokens Exponentially

Token Accumulation Through Multi-Step Workflow

Industry Implications

Impact by Stakeholder

Individual Developers

Engineering Teams

The Industry

What's Coming in This Ebook

Chapter 2: The Bloat Problem

Chapter 3: Training Distribution Bias

Chapter 4: Code-First Architecture

Chapter 5: Production Sandboxing

Chapter 6: Case Study—Unit Number Investigation

Chapter 7: The MCP Sweet Spot

Chapter 8: Implementation Roadmap

Chapter Summary

The Bloat Problem: Why Agents Get Dumber as You Add Tools

The Context Window: A Finite Resource

Typical Context Window Sizes (2025)

Problem 1: The Upfront Tool Definition Tax

Small Setup (10 tools)

Medium Setup (30 tools)

Large Setup (100 tools)

Real-World Impact: The GitHub MCP Server

Problem 2: Exponential Intermediate Result Accumulation

Worked Example: Google Drive → Salesforce Workflow

Step 1: Retrieve Document from Google Drive

Step 2: Update Salesforce with Transcript

The MCP Context Bloat Mechanism

The Compounding Effect: Multi-Step Workflows

Scenario: Research Agent (10 Websites)

Cost Impact: The Hidden Waste

What You See vs What Gets Billed

The Token Accounting Illusion

Visible to User

Actually Billed (30-tool MCP)

The Performance Degradation Mechanism

1. Signal-to-Noise Ratio Collapse

2. Context Window Crowding

3. Latency Increases

4. Selection Paralysis

The Scale Thresholds

When Does This Become Critical?

Chapter Summary

The Developer's Verdict: What Theo Browne Really Thinks About MCP

Why This Developer's Opinion Carries Weight

Theo Browne

The Moment of Vindication

The Web3 Pattern

Three Technical Failures Theo Won't Let Slide

1. The Spec Itself Is Fundamentally Incomplete

2. Models Get Dumber, Not Smarter, With More Tools

First Tool Call

Second Tool Call

Third Tool Call

Fourth Tool Call

3. The Creators Are Now Admitting It Doesn't Work

Why Code Actually Works Better: Theo's Explanation

The Training Distribution Insight

Real-World Evidence: Tools vs. Results

The Security Strawman

MCP's Security Reality

Code Execution Security

Where Theo Is Spot-On (And Where He Overstates)

Where Theo Is Absolutely Right

Five Points Developers Will Recognize

Where Theo Overstates the Case

Why Code-First Agents
Beat MCP by 98.7%