Context Arbitrage: Turn Intelligence from Opex into Capex

SF Scott Farrell July 4, 2026 scott@leverageai.com.au LinkedIn

AI Strategy · Field Guide

Context Arbitrage: Turn Intelligence from Opex into Capex

Your failing, expensive agent is usually a missing capital asset — not a missing capability. Compile the context once, and a cheap model behaves as if it knows your world. That is a price spread you can capture on purpose.

Scott Farrell · LeverageAI · leverageai.com.au

📚 Read the full field guide

Go deeper — the complete ebook works the full cost model and break-even arithmetic, takes apart the Gmail case as a controlled experiment across the three theories of why agents fail, scales the trade to an organisation’s reporting stack, and answers the cold-start objection: Context Arbitrage →

TL;DR

  • Model capability is an operating expense — paid on every call, forever, depreciating with each release. A compiled worldview is a capital asset — comprehension paid once, amortised across every future call, appreciating under maintenance.
  • That flip lets you capture the frontier-to-utility price spread on every task whose difficulty was context-depth masquerading as intelligence-depth. Call it context arbitrage.
  • The tell you were overpaying: adding the substrate raises the quality ceiling and lowers the model floor at once. You were paying frontier prices to compensate for a missing asset — not to buy intelligence the task needed.

I gave one of my scheduled agents access to my Gmail and asked it to do the most ordinary job in agentics: triage. Tell me what deserves my attention; leave the rest alone. It was hopeless — loud about receipts and newsletters, silent on the thread that actually mattered. I did the thing the whole industry trains you to do: I reached for a better model. It sorta helped. That faint improvement, and the money I spent chasing it, was the whole misdiagnosis.

Because the expensive part of that agent was never the decision. Is this worth interrupting me for is a single, cheap judgement. The expensive part was everything the model had to read and understand before it could make that judgement well — and I was paying frontier prices for comprehension while calling it intelligence. In most agent systems that feel expensive, you are paying frontier prices to read, not to decide, and reading is precisely the thing a cheap, cached model now does almost for free.

Two different things on one invoice

Split the bill and the strategy falls out of the accounting. There are two ways an agent gets the intelligence it needs, and they behave like completely different line items.

Model capability is opex. You rent it, pay for it on every call forever, and it depreciates — the premium you paid for this quarter’s cleverness evaporates the day a cheaper model clears the same bar. A compiled worldview is capex. You build it once; comprehension is paid a single time per source, turned into a durable, structured artefact, and from then on every call reads from that asset instead of re-deriving it. It does not depreciate with the next release — it appreciates under maintenance, because each pass adds structure the last one couldn’t see.

“Wiki plus a mini model” versus “frontier model plus raw context” is substituting capital for marginal spend — the oldest industrialisation move there is. You build the machine once so each unit costs less forever.

And the result is not “cheaper at the same quality.” The substrate raises the quality ceiling and lowers the model floor at the same time — better and cheaper, simultaneously. That double movement is diagnostic. If a missing asset weren’t the problem, adding it couldn’t both improve quality and reduce the model requirement. That it does both is the receipt: you were paying frontier prices to compensate for a missing asset, not to buy intelligence the task actually needed.

Name the trade: context arbitrage

There is a large — and, on every trend line I can find, widening — price gap between the frontier tier and the utility tier of models. A flagship and a utility model from the same lab differ in per-token price by an order of magnitude or more,1 and prefix caching widens the effective gap further on any workload that re-reads a stable context.2 A compiled worldview is a stable context — so it lets you buy the same underlying value on the cheap side of that spread while the market keeps pricing the expensive side.

Context arbitrage

Capture the price spread between frontier and utility models on tasks whose real difficulty is depth of context, not depth of intelligence — by paying once to compile the context into a durable asset, then running the recurring work on a utility model that performs as if it knows the domain, because it does.

This sits on top of a pattern I’ve written about before. The Scout and the Senior split a single agent along a time seam — a cheap read-only scout that explores, a frontier senior that decides — so you buy frontier judgement without paying frontier prices for the exploration that feeds it.3 That was the shape of one agent’s loop. This is about the asset both halves read from, and the accounting that asset changes.

Most agent work is triage in a costume

The arbitrage only pays if you can spot the tasks it applies to. Look at the agentic work that actually fills an organisation and one pattern dominates: routing, triaging, drafting, flagging, reconciling. None of these are hard decisions. Each is a simple call that is impossible until you’ve read and understood a great deal of context. Which team owns this — given how this org actually divides work? Is this the one anomaly that matters — here? Reply in whose voice — given this relationship’s history? The difficulty lives entirely in the context, not the reasoning.

The reframe

This is difficulty in disguise — context-depth wearing an intelligence costume. The task presents as “needs a smarter model” and is really “needs the context compiled.”

The costume is why the whole industry keeps making the same mistake in the same order. The agent underperforms; the failure gets read as a capability gap; the response is to wait for or pay for a bigger model; it helps a little — because raw intelligence can partly compensate for missing context, at a price — and that little bit of help confirms the wrong diagnosis. So the loop repeats on every release, each turn spending opex on what was always a capital problem.

You don’t need a framework to break the loop, just one question run as an experiment: does a bigger model fix it, or does more context fix it? Hold the model fixed and add the context the task keeps reaching for. If quality jumps, the difficulty was context-depth — you’ve found an arbitrage task, and the move is substrate, not spend. Fairness demands the boundary too: closed-world, tool-shaped tasks — a CI fixer, a form-filler, anything whose knowledge lives entirely in its APIs — really are capability or tooling problems, and the substrate does nothing for them. The arbitrage is everywhere the hard part is “you’d have to know a lot about us to get this right.”

The existence proof, sitting in my cron logs

Back to the inbox. I gave that same hopeless agent a wiki — a compiled worldview built from a few years of my email: who these people are, what we have in flight, what I care about, what I’ve already handled. I didn’t change the model. I didn’t rewrite the agent. I changed the input. And the same agent became a genius at triage — so much so that it now hardly ever tells me about my email at all, which turns out to be roughly the right angle.

Before — a stranger reading my inbox
  • Interrupted for newsletters, receipts, routine cc’s
  • Missed the quietly urgent thread with no keywords
  • Judged each email in isolation
  • Noisy and untrustworthy — so I stopped relying on it
After — an agent that knows my world
  • Interrupts only for the genuinely novel or time-critical
  • Catches the one that matters because it knows what’s in flight
  • Judges each email against what I already know
  • Mostly silent — and the silence is the point

Because nothing else moved, the experiment is unusually clean. The reason it works is that triage is a diff operation: “is this worth interrupting me for” is a comparison between the new thing and everything I already know, and a diff is impossible without something to diff against. The wiki is that something. Agent judgement quality is mostly a function of worldview access, not model capability — which is a mechanism I unpack in full in the ebook, but you can feel it here: the same operation that’s impossible for a stranger is trivial for an agent that has a model of you.

A genius reading your inbox cold is still a stranger. And a triage agent that talks a lot isn’t working — silence is the expensive, high-judgement output.

The commercially important part is that this yields a claim you can falsify in your own logs: a utility-class model with a compiled worldview outperforms a frontier model without one, on a real workload. Mine runs on a mini-class model and does the job perfectly; a bigger model without the wiki never did. That’s not a benchmark someone else ran — it’s a specific inbox, a specific result, demonstrable today.

What the trade costs, honestly

Fit, not hype — so concede the capex. Wiki ingestion is expensive (many model calls per source) and lossy (the compiled page is deliberately smaller than the source, so relationships fit). It only amortises when the same compiled understanding is reused. Model it in three quantities: ingestion capex C, per-call saving S (frontier cost minus utility cost on the task), and break-even N* = C / S. Express break-even as a call count, not a payback period, and the discipline becomes obvious.

The rule that decides it

  • Compile the high-frequency, context-shaped tasks — triage, routing, reconciliation. A task that fires on every email crosses its break-even in days, and pays the build back thousands of times.
  • Don’t compile the one-off lookup — an exhaustive prior-art sweep never amortises, and its query shape wants the raw variety the wiki compresses away. Leave that in plain retrieval.

The arbitrage lives entirely in one ratio: capex divided by reuse. Compile the things you’ll read a thousand times; rent intelligence for the things you’ll read once.

Stop waiting for a model

The strategic punchline is that this changes what a model release means to you. Frontier gains are handed to everyone on launch day — no relative advantage. The utility-model dividend, the quiet announcement that the same intelligence now costs a fraction as much, flows disproportionately to whoever built an architecture that can spend cheap tokens at volume. Without a substrate to pour them into, a price collapse is just a cheaper chatbot. With one, it compounds through everything you’ve built. The boring release was the revolution all along.

Model choice stopped being a bet on the next launch and became a procurement decision you can make today — re-run whenever the market shifts.

So build the arbitrage book. List your agent workloads, score each with the one question — does more context fix this, or a bigger model? — and rank the “more context” pile by frequency times current model cost. Take the top one, build the smallest wiki that covers just that task, and run it two ways: utility-plus-wiki against frontier-plus-raw. Measure quality and cost in your own logs. My prediction, from having lived it, is that the cheap side wins on both axes at once. The asset moves the value out of the model — where the market can reprice it and hand your competitor the same thing — and into a compiled worldview that is yours, appreciating, and unshippable by anyone upstream of you.

Go deeper

This article is the spine of a field guide — Context Arbitrage — which works the full cost model and break-even arithmetic, takes apart the Gmail case as a controlled experiment across the industry’s three theories of why agents fail, scales the trade to an organisation’s reporting stack, answers the cold-start objection (the moat and the barrier are the same object), and shows why the cheap, cached release is the one you should have been waiting for. The ebook link is in the post above. And if your agent bill is climbing and “we just need a better model” keeps being the answer, that’s exactly the work we do at LeverageAI — find the context-shaped tasks and move your recurring spend to the tier where it belongs.

References

  1. [1]Anthropic. “Pricing.” Published per-token pricing shows a large multiple between flagship and utility-tier models (for example, the Opus class versus the Haiku class) — the raw spread that context arbitrage captures. https://www.anthropic.com/pricing
  2. [2]Anthropic. “Prompt caching.” Cached input tokens are billed at roughly one-tenth of the base input rate, so re-reading an unchanged context prefix costs a fraction of processing it fresh — which compounds the frontier-to-utility spread on any read-heavy, wiki-backed loop. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
  3. [3]Farrell, Scott. “The Scout and the Senior: Swap the Brain, Keep the Transcript.” LeverageAI — the companion pattern: split an agent along a time seam so a cheap scout reads and a frontier senior decides, with the Model Barbell allocating cheapest-cached and smartest-frontier and nothing between. https://leverageai.com.au/the-scout-and-the-senior-swap-the-brain-keep-the-transcript/

Discover more from Leverage AI for your business

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

© 2026 Leverage AI, Scott Farrell. All rights reserved. This content is made available on a limited, revocable, read-only basis only. No licence or right is granted to copy, reproduce, republish, scrape, store, adapt, summarise, index, embed, or use this content to create derivative works, work product, deliverables, methodologies, training materials, prompts, templates, software, services, research, or commercial outputs, whether by humans or machines, without prior written permission. This restriction includes internal business use, client work, consulting, advisory, implementation, and any use in or for artificial intelligence, machine learning, data extraction, retrieval, evaluation, fine-tuning, or knowledge-base construction.