A LeverageAI Field Guide

The Scout and the Senior

Swap the Brain, Keep the Transcript

Frontier-quality agent decisions don't come from a bigger model. They come from where you put the model swap.

A cheap scout explores read-only and freezes the transcript; a frontier senior inherits it and signs one terminal decision — and prefix caching makes it the cheapest shape too.

The argument in three lines

  • Swap the brain, keep the transcript. A cheap read-only scout explores; a frontier senior inherits the whole walk and decides — no summary in between.
  • Dead ends are data. Summaries delete the ruled-out paths judgement needs; the confession field targets the senior's one expensive read.
  • The barbell, priced by caching. Cheapest cached scout, smartest frontier senior, nothing between — one deliberate cache break at the swap. Mid-tier demand is a smell.

Scott Farrell · LeverageAI

01
Part I · The Two-Speed Agent

The Handoff at the Wrong Boundary

The standard multi-agent design is so close to right that its one flaw is easy to miss. It compresses the exploration at exactly the seam where judgement is about to happen.

TL;DR

  • Most agent systems that feel expensive are paying frontier prices to read, not to decide — and reading is what a cheap, cached model now does almost for free.
  • The fan-out-and-summarise pattern deletes the scout's dead ends at the boundary where judgement needs them most.
  • The fix: keep the whole exploration transcript and swap the brain — a cheap scout explores, a frontier senior inherits and decides.

Here is a scenario you have almost certainly built, or are about to. You have a task too open-ended to script — file this source into a knowledge graph, triage this inbox, draft this proposal — so you reach for agents. You fan the work out: sub-agents go and explore, each returns a tidy summary, and a decider reads the summaries and chooses. It is a sensible instinct. Anthropic defines the agentic workload precisely: agents are systems where models “dynamically direct their own processes and tool usage” and may “operate for many turns” on problems where you cannot predict the number of steps in advance.1 When a task runs for many turns, summarising each strand to keep the whole thing affordable feels not just reasonable but necessary.

Look closely at where that summarisation lands, though, because the whole argument of this book turns on it. The sub-agent didn't just find an answer. It checked things and found them irrelevant. It hesitated. It followed an edge, decided the edge led nowhere, and backed out. Then it wrote a summary — and every one of those signals evaporated. What reaches the decider is a clean conclusion with the reasoning texture sanded off.

The reframe

The lossy compression sits at exactly the seam where judgement happens. The decider isn't just missing what the scout found — it's missing what the scout ruled out, and why.

That is the wrong boundary. Judgement is precisely the faculty that needs the abandoned paths, the near-misses, the “I looked here and it wasn't relevant.” An experienced reviewer trusts a recommendation more when they can see the alternatives that were weighed and dropped — a principle we have argued elsewhere under the banner that a system which cannot show what it didn't recommend is an answer generator, not a reasoning partner. Summarising for the decider throws that away by construction.

Invert the handoff

So don't summarise the exploration for the decision-maker. Hand over the raw exploration transcript, and swap the brain mid-conversation. Keep the entire conversation history — every page the scout read, every dead end — and, at the point of decision, change three things at once: swap in the frontier model, rewrite the system prompt to say “a junior prepared this; your job is to finalise it,” and grant the write tool. Same transcript. New brain. New remit.

The scout is the cheapest model that can still walk the territory without getting lost. It gets a read-only toolbelt and one instruction — explore thoroughly, decide nothing — and it ends its turn by calling a single tool, review_done. No write tool. No decision tool. Just a thorough, honest walk, with every tool result written straight back into the conversation so the transcript accumulates the whole search.

The senior then inherits complete situational awareness for the price of reading it once. It can see not only the scout's conclusions but the shape of the scout's search — where it was confident, where it doubled back, which edges it never visited. And it emits a single terminal decision over that inherited context.

Key Insight

The conversation is the working set — pre-materialised by a model priced for reading. Swapping model, tools and framing while keeping the token history is the cheapest possible way to transfer complete situational awareness.

How this differs from the micro-agent patterns you already run

If you have been building agents for a while, this may sound adjacent to patterns you know, so let us draw the line sharply. In a micro-agent architecture, a Router hands to Supervisors who hand to Workers — small, single-responsibility agents, each holding its own context, arranged like microservices. In a discovery accelerator, a Director convenes a Council of specialists to surface options and the reasons they were rejected. Both patterns split work across agents that each maintain a separate context.

The scout–senior split is a different cut. It splits one task along a time seam — a gathering phase and a judging phase — and passes the entire context across that seam intact. Nobody summarises for anybody. The handoff is a mid-conversation model swap with full transcript inheritance, and that zero-loss handover is the mechanism the rest of this book unpacks.

Two ways to reach the same decision

Summarise, then decide
  • • Sub-agents explore, each writes a summary
  • • The decider reads polished conclusions
  • • Dead ends, hesitations, ruled-out edges are gone
  • • The lossy step lands right before judgement
Inherit, then decide
  • • A scout explores; the full transcript is kept
  • • The brain is swapped over that frozen transcript
  • • The senior sees what was checked and dismissed
  • • No compression anywhere between reading and deciding

The reader question this book answers is narrow and practical: how do I get frontier-quality decisions without paying frontier prices for the exploration that feeds them? The answer is not a bigger model. It is a place to put the model swap, a discipline for what crosses it, and an economics — prefix caching — that makes the whole arrangement the cheapest shape rather than the most extravagant. The next chapter is about the one thing that must survive the handoff: the dead ends.

Swap the brain, keep the transcript. You're paying frontier prices to read, not to decide — so let a commodity model do the reading, and hand the whole of it to the frontier model at the one moment it matters.
02
Part I · The Two-Speed Agent

Dead Ends Are Data

The scout's abandoned paths are not waste to be trimmed. They are the single most useful thing it produces — and the thing a summary is guaranteed to delete.

Watch a real scout walk and you will see it do something that looks, at first, like failure. It follows a promising-seeming edge to a neighbouring page, reads it, and concludes the connection isn't there after all — then backs out and carries on. A summariser would never mention that detour; it produced no conclusion. But that “I checked, and there was nothing there” is exactly what stops the senior from re-walking the same dead end, and exactly what it needs to trust that the region really was searched.

scout_transcript.jsonl (excerpt)
grep "prefix caching" → 3 hits
get_page framework.cognition-ladder
# looked like the caching home; it isn't
# ladder allocates by task rung, not phase
get_page framework.cognition-supply-chain
# unsure: is caching a claim here or its own page?
review_done({ unverified: ["caching home page"] })

Hold that excerpt in mind. The scout recorded two things a summary would have dropped: a dead end (the ladder page looked like the natural home for a caching claim, and wasn't), and an admission of uncertainty (it couldn't tell whether caching belonged as a claim on an existing page or deserved its own). Neither is a conclusion. Both are gold. The dead end saves the senior a wasted read; the admission tells the senior exactly where to spend one.

This is the zero-loss handoff working as designed. The senior doesn't inherit a verdict — it inherits the whole epistemic trail: what was checked and found irrelevant, where the scout hesitated, which edges it followed and abandoned. Dead ends are informative, and the only way to keep them is to refuse to compress.

Key Insight

Dead ends are informative. Summaries delete them. The senior needs not only what the scout found, but what it ruled out — and the reasons it ruled it out.

The senior's blind spot

There is a genuine failure mode here, and you should name it before you ship. The senior's entire view of the world is framed by the junior's walk. If the scout never visited a region, the senior cannot see that the region is missing — junior blindness becomes senior blindness. A frontier model reasoning flawlessly over an incomplete map is still reasoning over an incomplete map, and it will do so with total confidence, because nothing in the transcript signals the gap.

Two mitigations close most of that gap, and they cost almost nothing.

Inject the top-level map, always

Whatever the scout walked, give the senior the one-screen overview of the whole territory as a persistent header — unconditionally, every time. Now, even if the scout never descended into a neighbouring region, the senior can see that the region exists adjacent to the change it is about to make, and can choose to look. This is cheap in the currency that matters: a one-screen map costs little of the attention budget, and attention — not raw token capacity — is the thing you are actually rationing.

Require a confession field

Make review_done declare its own uncertainty. Not a free-text apology — a structured field: regions I did not explore, claims I could not verify, places I was guessing. The junior confessing its blind spots is what converts the senior's extra reads from a random spot-check into a targeted audit. Remember that the senior still has the read tools; the confession field is what tells it where to point them.

What the confession field buys you

✗ Without it

The senior either trusts the whole transcript uniformly, or re-reads at random to feel safe — paying frontier prices for spot-checks that mostly land on regions the scout already covered well.

✓ With it

The senior re-reads only the handful of edges the junior flagged as guesses, confirms or corrects them, and commits. The extra frontier reads land exactly where the risk is.

Stabilise before you commit

Notice what the senior actually does with the confession: it re-reads the uncertain edges before it writes anything. Exploration stabilises, then the decision lands. That is the same coarse-to-fine discipline we apply to any complex generation — fix the structure at the level where it is still cheap to fix, and only add high-resolution commitments once the layer beneath them is stable. Applied to writes, the rule reads: place the change in the map first (which pages does this touch?), and only then commit claims and edges. Don't stack a high-resolution decision on an unstable placement.

Put the map injection and the confession field together and the senior stops being a brilliant model trapped inside a junior's field of view. It becomes a brilliant model that knows the shape of the whole territory, knows precisely which parts of the junior's report to trust, and knows exactly where to spend its expensive attention. That is a very different, and far safer, thing.

The junior declaring its own uncertainty is what lets the senior target its extra reads instead of spot-checking at random. A confessed blind spot is worth more than a confident summary.
03
Part I · The Two-Speed Agent

One Terminal Mutation

When the senior finally acts, how it acts decides your bill. Emit one big decision document, not a stream of small mutation calls — and discover you have built a governance layer by accident.

The first version I built let the senior act the natural way: incrementally. Add a source. Then add an edge. Then update a claim. Three tidy tool calls, each doing one clean thing. It cost a fortune, and it took me a moment to see why, because nothing about any individual call looked expensive.

The reason is structural, and once you see it you cannot unsee it. Every tool round-trip re-processes the entire conversation as input. The model doesn't remember the transcript between calls the way you remember a page you just read; the whole history is re-sent and re-billed on every turn. So “add a source, then an edge, then a claim” isn't three small operations. It is three full passes over an ever-growing context — and on a frontier model, over a transcript that already contains a scout's entire exploration, three passes is how a decision that should cost cents ends up costing dollars.

Key Insight

N small mutation calls cost roughly N passes over a growing context. One terminal decision document costs approximately one read and one generation.

So change the shape of the act. Have the senior emit one big terminal decision document — a single structured output describing every change at once. Pages to create. Claims to update. Edges to draw. The output is complex, but the model produces it in one pass: one read of the gathered context, one generation, done. The scout paid the token-heavy exploration bill on a commodity model; the senior now pays once, at the decision point, and never in a loop.

mutation_document.json — the senior's single terminal output
{
  "create_pages": [{ "id": "concept.zero-loss-handoff", "summary": "…" }],
  "update_claims": [{ "page": "framework.micro-agents",
                       "claim_id": 4, "text": "…" }],
  "edges": [{ "from": "concept.zero-loss-handoff",
           "to": "framework.decision-authority", "type": "governed-by" }],
  "confession": { "unverified": ["…"], "not_explored": ["…"] }
}

The document is quietly the governance layer

Here is the part that surprised me. The single mutation document isn't only cheaper. It is governable in a way that a stream of live tool calls can never be, because the model only ever proposes — deterministic code applies. And between propose and apply sits a gate.

Before the document touches anything real, you can lint it. Do the referenced page IDs actually exist? Are the edge types legal? Are the claims it wants to update real claims on real pages? A live sequence of tool calls has already mutated the graph by the time you notice the third one was nonsense; a proposed document is inert until you accept it. If it fails the lint, you reject it back to the senior for a single repair round — the confession field usually tells you where the trouble is — and only a clean document is ever applied.

This is the difference between hoping the model behaves and being able to constrain it before the fact. Most AI governance can observe and explain a decision after it has executed, but cannot technically prevent a bad one from executing in the first place. A propose-then-verify boundary inverts that: authority, policy and evidence are checked in-path, before any action can occur. The mutation document is that boundary, and you got it for free the moment you stopped letting the model write incrementally.

The same act, two shapes

N passes

Incremental writes: each tool call re-processes the whole transcript, and the graph mutates before you can validate it

1 pass

One terminal document: one read, one generation, lintable before it touches anything, applied as a single commit

Apply it once, and keep the receipts

Apply the document as a single deterministic commit and two more properties fall out. Transactionality: the whole change lands or none of it does — no half-written graph from a mutation that failed on call three of five. Replay: the scout's transcript plus the mutation document together are a complete record of what the system knew and what it decided. When a consolidation looks wrong six months later, you don't guess — you replay exactly what was in front of the model.

That pairing has a name worth keeping: a decision attestation package. The junior transcript is the evidence; the mutation document is the verdict; the commit is the signature. Every change your system ever makes carries its own audit trail, not because you bolted governance on afterwards, but because the write path forced it. Hope is not a control. A proposed, linted, replayable document is.

And notice — we will return to this in Chapter 7 — that the one-terminal-call rule and the caching economics of the next chapter are the same insight seen from opposite ends. Both say: minimise the number of expensive passes over the transcript, and let the cheap passes be as many as they like.

The model proposes; deterministic code applies. One lintable document, applied as one commit, replayable forever — you set out to save money on tool calls and accidentally built the governance layer.
04
Part I · The Two-Speed Agent

The Model Barbell

Cheapest cached model at one end, smartest frontier model at the other, and almost nothing in between. Not a cost hack — a consequence of the architecture, with a design rule you can run in reverse.

I will admit something. For two years I got excited the same way everyone did: a new frontier model drops, there's a launch, benchmarks, a demo day, and I couldn't wait to point it at my hardest problems. What I did not notice, for far too long, was that the release I should have been waiting for was the other one — the cheaper, faster, cached model, arriving with no fanfare as a quiet line on a pricing page. That one had the bigger compounding impact on everything I'd built. The frontier release is a step function you can feel. The cheap-token release is an exponential with no press conference, and it is the one that changed my systems.

Once the scout–senior split sorts your cognition into two piles, you can see why. Each pile lands on a natural price point, and the middle gets squeezed out from both sides.

Two piles, two prices

Exploration is breadth-bound and judgement-light. The work is reading, following edges, and not getting lost. Past a “just enough intelligence” threshold, a smarter scout barely improves the walk — but a cheaper scout buys you more walking, and more exploration beats marginally cleverer exploration almost everywhere. So at the gathering end you want the cheapest model that still navigates reliably.

Judgement is quality-bound and volume-light. It is one terminal pass, so you buy the very best model you can — and because it runs once, over a context the cheap model already assembled, the bill stays small. So at the deciding end you want the smartest model available, and you can afford it precisely because it is rare.

This is our cognition-allocation doctrine, but implemented at a finer grain than usual. Where the Cognition Ladder allocates intelligence by task — real-time versus batch versus overnight — the barbell allocates by phase within a single task: commodity cognition for the gathering phase, frontier cognition for the judging phase, inside one job.

The design rule, run in reverse

The mid-tier is worse than the cheap model per token at reading, and worse than the frontier model per decision at deciding. So if a task seems to want a mid-tier model, that's a smell — gathering and judging haven't been separated yet. Mid-tier demand is architectural debt announcing itself.

Caching is what makes the cheap end nearly free

The barbell only closes because of a second mechanism, and it is the more important one, because it changes the complexity class of the loop rather than merely discounting it.

An N-turn agent conversation without caching is quadratic. Every turn re-sends and re-processes the entire growing prefix, so the total cost climbs with the square of the conversation length. That is the same tax we met in Chapter 3 with incremental writes — it is just the whole loop paying it. Write every tool result back into the conversation, as the scout does, and without caching you have built something that gets more expensive per turn the longer it runs.

Automatic prefix caching inverts that. Anything the provider has already seen in the prefix is billed at a small fraction of the base rate — on the order of a tenth2 — so each turn's marginal cost is roughly just its new tokens plus a heavy discount on everything already read. The loop goes from O(N²) to near-linear, and near-constant per turn. In my own production wiki work that lands at roughly three-hundredths of a cent per turn on the cheap models, no matter how deep the conversation runs.

What caching changes

O(N²)

uncached agent loop — every turn re-processes the whole growing prefix

≈ linear

with prefix caching — marginal cost per turn is roughly just the new tokens

~0.03¢

observed cost per scout turn on the cheap models, regardless of depth

This quietly converts the scout's “write every tool result into the conversation” design from an extravagance into the economical structure. Append-only transcripts are the cache-optimal shape. And that carries a real constraint: anything that mutates the prefix — reordering tools, editing history, changing the system prompt mid-stream — invalidates the cache and drops you back toward quadratic.

Bottom Line

Your scout-to-senior handoff mutates the prefix — it swaps the model, the tools and the system prompt at once. So do it exactly once, deliberately, at the point of maximum value. Freeze the prefix for the whole walk; pay for one cache break at the brain swap.

The scout tier is a commodity slot — treat it like one

Because the scout is deliberately the cheapest viable model, choosing it stops being a bet and becomes a procurement decision. Which model? Look at what people paying per token under real agentic load actually reach for. Benchmarks measure peak intelligence; popularity among people spending their own money on real workloads reveals the price-competence frontier — closer to the question you actually have. Its bias is toward hype and defaults, but you already own the correction: a model swap on the scout tier is an auditable bake-off. Run the same exploration on two candidates and diff the walks — did the challenger visit the pages that mattered, or wander? Re-run it whenever the rankings shift, and let the cheap end of the barbell keep getting cheaper underneath you.

I use the cheapest cached model and the smartest frontier model, and almost nothing in between. The two ends got more different; the exciting end got less scarce, and the boring end got more valuable.
05
Part II · Inside One Build

Anatomy of a Scout Walk

Enough principle. Here is a real scout doing a real job — the ingestion agent that files a new source into a self-maintaining knowledge graph — running on the cheapest cached model, one page at a time.

The system this book is drawn from is a wiki: a graph of pages, claims and typed edges that a set of agents build and maintain, so that other agents can reason over a compiled worldview instead of raw documents. When a new source arrives — an article, a project folder, a session transcript — something has to read it, work out where it belongs, and wire it into the existing graph. That something is our flagship scout, and watching it work makes every abstraction in Part I concrete.

The toolbelt

The scout gets a read-only kit and nothing else. Six tools do the navigating, arranged as a resolution ladder from one-screen overview down to ground truth:

Overview — cheap, always resident

map — the one-screen index of the whole wiki: pure information scent.

get_page — a single page's claims and edges (the deliberately blurry layer).

Structure — regenerated on demand

get_skeleton — signatures, docstrings and structure without full bodies.

grep — pinpoint verification across the live source.

Ground truth — read live, never cached

read_file / get_session_segment — full-resolution source when a question earns it.

The skeleton tool is worth a note, because there is battle-tested precedent for it. Aider's repository map parses code with a syntax-aware parser, ranks the important symbols, and renders signatures and structure, not full bodies, inside a token budget.3 That is exactly the scout's instinct — pull the shape of a source cheaply, and only descend to the bytes where judgement demands it.

The scout also carries two search tools, which are the ladder's endpoints. One is a single, fast query against the map. The other is an agentic walker that answers a question by walking the wiki and returns an answer with its provenance. Giving the scout its own search means that when it is unsure where something belongs, it can go and look mid-walk rather than guess — a property we will lean on hard in Chapter 8.

A North Star, not a checklist

What the scout does not get is a procedure. We don't hand it a category form to fill in. We give it a purpose — file this source into the worldview; capture what it means and how it connects, not what it literally says — and let its judgement decide how far down the ladder each question deserves to go. A weekend spike gets a shallow pass; a dense, framework-adjacent project triggers a deep crawl; and nobody had to pre-decide which was which.

Watch it walk

Here is a trimmed transcript from a real ingestion — the scout deciding where a claim about prefix caching belongs. Read it for the two moves a summary would have destroyed: a dead end, and a confession.

ingest_scout.jsonl — cheapest cached model, low thinking
map() → 633 pages loaded as scent
grep("prefix caching") → framework.cognition-ladder, framework.cognition-supply-chain
get_page(framework.cognition-ladder)
# dead end: the ladder allocates by task rung, not by caching.
# this is not the home for a caching claim. back out.
get_page(framework.cognition-supply-chain)
# closer — this page owns token economics. plausible home.
# unsure: caching may deserve its own concept page.
# the walker found no existing page named for it.
review_done({
  proposed_home: "framework.cognition-supply-chain",
  confession: { unverified: ["is caching a claim or its own page?"],
                not_explored: ["context-engineering neighbours"] }
})

That is the whole shape. The scout followed a plausible edge to the cognition-ladder page, found it was the wrong home, and recorded the dead end. It found a better candidate but couldn't resolve a real question — claim or its own page? — and confessed it, along with a region it hadn't walked. Then it stopped. No write. No decision. Just an honest map of what it found, what it ruled out, and what it wasn't sure of.

Key Insight

A cheap model on low thinking stays accurate here because the map holds the navigation state. Following a named link on a page written to be read is the single most native act an LLM can perform — it's just reading.

That last point is why the barbell's cheap end works at all. Walking a graph of named, typed edges is squarely in an LLM's training distribution; formulating queries against an opaque embedding space is not. The map does the navigation thinking, so the model only has to do the part it is best at — and you can buy the cheapest one on the rankings page and still get an accurate walk.

One last thing about that transcript: we keep it. Every scout walk is cheap to store and is its own asset — provenance for the decision it fed, and raw material about how the system explores. In the next chapter, the senior picks up exactly this transcript, reads the confession, and turns it into a decision.

Watching a cheap model on low thinking walk a map is uncanny — it never asks for the same page twice. You've externalised the cognition the model is worst at and left it the part it's best at.
06
Part II · Inside One Build

The Senior Signs the Edges

The same conversation, a new brain. The frontier model reads the scout's confession, re-checks exactly the edges the junior was guessing at, and emits one document that changes the graph.

Pick up the transcript exactly where Chapter 5 left it. The scout has proposed a home for the caching claim, confessed that it couldn't decide between “a claim on an existing page” and “its own concept page,” and flagged a neighbourhood it never walked. Nothing has been written. Now we perform the handoff.

We keep the entire conversation — every page the scout read, the dead end at the cognition-ladder page, the confession — and we change three things at once. Swap in the frontier model. Rewrite the system prompt: a junior has explored this source and prepared a proposal; your job is to finalise it into a single mutation document. Grant the write tool. The transcript is frozen; the brain is new.

The first thing the senior does is the thing the whole design exists to enable: it reads the confession field, and it goes straight to the uncertainty. It doesn't re-read the whole walk — that would be paying frontier prices to re-do the scout's cheap work. It re-reads the two or three edges the junior admitted it was guessing at. In our running example, it pulls the context-engineering neighbours the scout skipped, confirms the caching idea is distinct enough from the supply-chain page's existing claims, and resolves the open question: caching earns its own concept page, with an edge back to the supply-chain framework.

Key Insight

The frontier model earns its price on a handful of tokens — the edges the junior confessed it was unsure about — not on re-reading the whole exploration. It targets its intelligence, because the scout told it where the risk was.

Two roles, one machine

Before the schema, hold the two roles side by side, because everything that differs between them is deliberate and everything that's shared is the point.

  The Scout The Senior
Model Cheapest cached model that walks the map Smartest frontier model available
Tools Read-only toolbelt + search; ends on review_done Same read tools + the write tool
North Star Explore thoroughly; decide nothing; confess uncertainty Finalise this into one governed decision
Output A frozen transcript + a confession One terminal mutation document
Shared The same conversation, the same map, the same read machinery — only the brain, the remit and one tool change at the seam.

The mutation document

The senior's single output describes every change at once, and its schema is where the wiki's whole ontology lives — pages, claims and typed edges, including cross-wiki edges that link a source into the canonical concept graph without minting duplicate pages.

mutation_document.json — the senior resolves the scout's confession
{
  "create_pages": [{ "id": "concept.prefix-caching",
                   "type": "concept", "summary": "…" }],
  "edges": [
    { "from": "concept.prefix-caching",
      "to": "framework.cognition-supply-chain", "type": "evidenced-by" },
    { "from": "concept.prefix-caching",
      "to": "project.wiki-ingest", "type": "implemented-in" }
  ],
  "resolves_confession": ["is caching a claim or its own page? → own page"]
}

One tool call carries all of it. The senior pays for approximately one read of the gathered context and one generation, exactly as Chapter 3 promised — and because it is a single document rather than a stream of live writes, it is inert until we accept it.

Lint, apply, replay

Now the governance boundary does its work. We lint the document before anything touches the graph — do framework.cognition-supply-chain and project.wiki-ingest exist, is evidenced-by a legal edge type, is the new page well-formed? If it fails, it goes back to the senior for one repair round. If it passes, deterministic code applies the whole thing as a single commit.

The write path, end to end

Detect → Escalate → Apply → Record

  • Detect: lint the proposed document against the live schema and graph.
  • Escalate: on failure, reject back to the senior with the specific violation — one repair round.
  • Apply: on success, commit every change atomically — all of it lands, or none does.
  • Record: keep the transcript + document + commit as one replayable package.

This is the propose-then-verify boundary from Chapter 3, running against the system's own memory — the wiki's maintenance is governed by an in-path enforcement layer, for free, because the write path forced it. And it honours the stabilisation gate: the senior placed the change in the map (a new concept page, its edges) before committing the high-resolution claims that hang off it, rather than stacking claims on an unstable placement.

Six months from now, if a janitor consolidation looks wrong, we won't argue about it. We'll pull the attestation package — the scout's exact walk, the senior's exact document, the commit — and replay precisely what the system knew and what it decided. Which brings us to the question every finance person eventually asks: what did all this actually cost?

The scout maps the territory and confesses its doubts; the senior signs only the edges that were in doubt. Cheap attention explores the whole graph so expensive judgement can touch just the parts that matter.
07
Part II · Inside One Build

Three-Hundredths of a Cent

The claim from Chapter 4, now with the arithmetic worked and the cron log open. Why a deep scout walk costs almost nothing, where the one expensive pass actually lands, and how to decompose a task that feels like it wants a mid-tier model.

Open the cost log for the ingestion agent and the striking thing is what doesn't change. A scout walk that touches five pages and one that touches forty cost almost the same per turn — roughly three-hundredths of a cent on the cheap models. Depth is nearly free. That is not a rounding artefact; it is the whole economic argument made visible, and it is worth working the arithmetic so you can predict it rather than just observe it.

Why depth is nearly free

Take an N-turn scout conversation where each turn adds a chunk of new tokens (a page, a grep result) to the transcript. The cost driver is how many tokens the model processes across the whole walk.

The same forty-turn walk, two billing regimes

~820×

Uncached. Turn k re-processes the whole prefix, so total token-passes grow with N². Forty turns is on the order of forty times the cost of one — the loop is quadratic.

~40×

Cached. The already-seen prefix bills at roughly a tenth, so each turn's real cost is close to just its new tokens. Total grows with N — the loop is near-linear.

The exact multiplier depends on your token sizes, but the shape is robust: without caching, the marginal cost of turn forty is forty times the cost of turn one, because the model re-reads everything each time. With prefix caching billing seen tokens at about a tenth,2 turn forty costs about the same as turn one plus a rounding fraction for the re-read. Quadratic collapses to linear, and per-turn cost flattens. That flat line in the cron log is the collapse, printed.

Key Insight

The scout pays the token-heavy exploration bill on a cheap, cached model where it's near-linear and near-free. The senior pays for exactly one expensive read, at the decision point. Neither end ever pays the quadratic tax.

Where the one expensive pass lands

The scout’s handoff to the senior is the one deliberate cache break in the whole system — it swaps the model, so the cheap cached prefix cannot carry over. That means the senior's first read of the transcript is priced at frontier rates, uncached. This is the single most expensive event in an ingestion, and it is worth seeing exactly how small it still is, because it runs once per source, over a transcript a cheap model already assembled, and it produces one document rather than a loop.

Phase Model Passes over the transcript Cost shape
Scout walk (N turns) Cheapest cached N, each mostly cache-read Near-linear, ~0.03¢/turn
The brain swap Frontier 1 uncached read (the cache break) One-off, paid once per source
The decision Frontier 1 generation (one document) Terminal, no loop

Put plainly: you pay for one frontier read and one frontier generation per source, and everything else — the reading, the exploring, the dead ends — runs at commodity, cached rates. The comprehension you bought is paid once and then amortised across every future query that touches the pages it produced, which is why compiled understanding behaves like a capital asset rather than a per-call operating expense.

Decomposing a “mid-tier-shaped” task

The barbell heuristic says mid-tier demand is a smell. Here is what curing it looks like in practice. Suppose you have a task that feels balanced: “read this customer's whole account history and decide whether to auto-approve a refund.” The instinct is to reach for a solid mid-tier model — smart enough to judge, cheap enough to run at volume. Resist it, and split:

Gathering (cheap, cached scout)

Walk the account: prior tickets, refund history, plan, flags. Assemble the whole picture into the transcript. Confess anything ambiguous (“two conflicting notes on eligibility”). Near-linear, pennies at volume.

Judging (frontier senior)

One pass over the assembled history, targeting the confessed conflict. Emit one decision document: approve or escalate, with the reasons attached. Runs once. Best model, small bill.

The mid-tier vanished because the “balance” it was serving was really two different jobs sharing one call. Most enterprise agent work is this task in disguise — a simple decision that requires deep context: triage, routing, drafting, reconciling. The industry keeps misdiagnosing those as capability gaps and reaching for a bigger model; the barbell says the fix is decomposition, and the models to build it with are already on the rankings page.

Choosing the scout by revealed preference

Which cheap model? Don't trust the benchmark leaderboards; they measure peak intelligence, which is the one thing the scout doesn't need. Look instead at the popularity rankings among people paying per token under real agentic load — a revealed preference for the price-competence frontier. Then correct for their hype bias the way only you can: run the same ingestion on two candidates and diff the walks. Did the challenger reach the pages that mattered in fewer turns, or did it wander and re-fetch? The scout tier is a commodity slot by design, so keep re-running that bake-off, and let the cheap end drift cheaper underneath you while your architecture stays put.

Depth is nearly free on the cheap end, and judgement is rare on the expensive end. The flat line in the cron log is a quadratic loop collapsing to linear — you're just watching the architecture pay for itself.
08
Part III · One Machine, Many Jobs

One Toolbelt, Many North Stars

The scout–senior pattern isn't the only thing that repeats. The whole agent repeats. The ingester, the query agent and the maintenance janitor are the same machine pointed at different destinations.

The thing that struck me once the ingestion agent was working was how little new code the next agent needed. The query agent — the one that answers a question by walking the wiki — was almost the same code. And the janitor, the one that tidies and compacts the graph on a schedule, was almost the same code again. That is not a coincidence to be tidied away. It is a design principle worth building on deliberately: one explorer toolbelt, many North Stars.

All three roles share the same navigation kit from Chapter 5 — map, get_page, get_skeleton, grep, read_file, get_session_segment, and the two search tools. What differs between them is small and precise: the North Star they are pointed at, whether they hold the write tool, and the loop that drives them.

Query — read-only, interactive

North Star: “Answer from the map; descend the ladder only where the question warrants it.”

Write access: none. Loop: interactive, one question at a time.

Ingest — writes, batch

North Star: “File this source into the worldview; capture what it means and how it connects.”

Write access: the write tool, via the scout–senior handoff. Loop: batch, one source at a time.

Janitor — writes, cron

North Star: “Compact the graph; merge redundancy, retire cold pages, mend broken edges.”

Write access: the write tool. Loop: cron, over the whole graph on a schedule.

Each write-holding role uses the same scout–senior split inside it — a cheap explorer walks, a frontier finaliser emits one mutation document. The pattern nests: one machine, three North Stars, and the same two-speed engine underneath each.

The reason that matters more than saving code

Sharing the toolbelt saves build effort, obviously. But there is a deeper reason to enforce it, and it is the part most teams miss. The ingestion agent must experience the wiki through exactly the same eyes as the future query agent — or it will file things where no reader will ever look.

Think about what filing actually requires. To place a source well, the ingester has to answer a question about the future: “where would a reader look for this?” If the ingester navigates the wiki through different machinery than the reader will, it is guessing at that answer. But give the ingestion agent the query tools — the same map, the same walker the reader uses — and the question stops being a guess. The ingester answers it empirically, by actually looking, with the reader's own eyes. It files the source where it just confirmed a reader would go.

Key Insight

The writer rehearses the read. Shared navigation code is what guarantees the writer and the reader share a map — the ingester is writing for its own future self, and it can only do that if it walks the way the reader will.

This is why the shared toolbelt is a correctness property, not just an efficiency one. Writer/reader symmetry falls out of the shared code: the agent that files a page and the agent that later retrieves it are, by construction, using the same notion of “where things live.” Break the symmetry — let ingestion use a bespoke filing path — and you reintroduce the classic failure of every knowledge base, where things are technically stored and practically unfindable.

Testing collapses to one harness

There is a bonus, and it is a large one. Because all three roles walk the same machinery, you test them the same way — and the way is not what you'd expect. You don't test the answer; you test the path. Did the agent read the canonical source? Which pages did it actually touch? Did it descend the ladder where the question warranted it, and stop where it didn't? The path is a property of the map and is inspectable; the final prose is a property of the model and is not. One test harness — test-the-path — covers the query agent, the ingester and the janitor, because they all walk the same path machinery.

The platform transition

When every agent is toolbelt + North Star + the same substrate, agents commoditise. The marginal cost of a new agent collapses to writing a North Star and granting the toolbelt — agent number seven costs a paragraph. That's not five agents; it's one platform with five profiles.

That collapse is the real dividend of the shared machinery, and it is what lets a very small team run a fleet. But the query agent has one more trick that the others don't — and it is the one that stops the whole system from slowly decaying. That is the next chapter.

The ingest, the query and the janitor are one explorer with three destinations. Only the loop driver differs — interactive, batch, cron — and the writer literally rehearses the read.
09
Part III · One Machine, Many Jobs

Querying the System Improves the System

The query agent shares its toolbelt with the ingester and the janitor. That sharing buys one more thing the others can't: every query becomes a chance to catch the graph being wrong — and to fix it.

Here is a moment that happens in every knowledge system, and that most of them handle badly. A query agent is walking the wiki to answer a question. It descends the ladder — from the map, to a page, down to the live source the page claims to summarise — and it finds a contradiction. The page says one thing; a grep against ground truth says another. The wiki, at this spot, is simply wrong.

What should the agent do?

The tempting move — and the wrong one — is to answer around it. The agent has the ground truth in front of it; it can just quietly use the correct value, produce a good answer, and move on. The user is happy. But the wiki is still wrong, the next query will hit the same contradiction, and you have taught your system to paper over its own errors at read time, forever. That is patching at the highest resolution, in the one place a patch does no lasting good.

Two responses to a contradiction found at query time

✗ Silently answer around it

  • • Use the correct value, produce a good answer, move on
  • • The page stays wrong; the next query hits it again
  • • The system learns to hide its own errors

Outcome: slow, silent decay

✓ Escalate — log a janitor work item

  • • Answer the user, and record the contradiction
  • • The janitor fixes the page at its own resolution
  • • The next query hits a corrected graph

Outcome: the map improves from being used

The write-back loop

The right move is to escalate. Answer the user, yes — but also log a work item: page X contradicts its source; pages Y and Z appear to overlap. That work item goes into the janitor's queue. The janitor, running on its own cron loop with its own compaction North Star, picks it up later and fixes the graph at the level where the error actually lives — rewriting the claim, mending the edge, merging the duplicates. This is the Exception Protocol applied to a knowledge system: detect the failure, escalate it rather than patch around it, let the maintenance role refactor, and recompile the affected region.

janitor_work_item.json — emitted by the query agent, consumed by the janitor
{
  "kind": "contradiction",
  "page": "framework.cognition-supply-chain",
  "claim_id": 7,
  "evidence": { "source": "ask_cli/provenance.md", "grep": "…" },
  "detail": "page states 0.3c/turn; source shows 0.03c/turn",
  "found_by": "query-agent", "provenance": "session:4c1a#turn-12"
}

Key Insight

Query-time failures become the janitor's queue. Which means querying the system improves the system — the map gets better from being used, not only from being fed. That write-back loop is what makes the whole thing compound instead of decay.

The failure role-symmetry prevents

Now recall the misfiling risk from Chapter 8, and watch how the shared toolbelt earns its keep here too. Suppose ingestion had used a bespoke filing path — a clever category heuristic the reader's machinery doesn't share. A page gets filed under a label that made sense to the ingester's private logic but that no reader would ever navigate to. Now the query agent, walking the reader's map, never reaches the page. It cannot even detect that the page is stale or wrong, because it never lands there. The contradiction goes uncaught, the work item is never emitted, and the page rots in a corner of the graph nobody visits.

Because the ingester filed through the reader's own eyes, that failure doesn't arise. The page sits where the query agent walks, so when it's wrong, the query agent finds it, and the write-back loop kicks in. Role-symmetry isn't just tidy — it is what keeps every page inside the blast radius of the self-correction mechanism. A page filed where no reader looks is a page the janitor can never be told to fix.

Drift, made visible

Step back and the shape is familiar to anyone who has run production software. An unmaintained knowledge graph drifts silently — claims go stale, edges break, duplicates accrete — and you don't notice until an agent gives a confidently wrong answer. The write-back loop turns that silent drift into a visible queue. Every query is a regression test that either passes or files a bug. The janitor is continuous integration for your worldview: it runs on a schedule, works the queue, and makes drift visible before it becomes failure. The index isn't a static artefact you build once; it is a living thing that the act of querying keeps honest.

Don't answer around a wrong page — escalate it. Query-time failures become the janitor's queue, and a system that improves from being used is a system that compounds instead of decays.
10
Part III · One Machine, Many Jobs

A Procurement Decision, Not a Bet

The scout–senior split, the barbell, the shared toolbelt — put them together and model choice stops being a gamble on the next release. It becomes a procurement decision you can re-run in an afternoon.

There is a small irony in the fact that I worked out much of this book by talking to a frontier model, and spent a good part of that conversation explaining to it why the frontier model was the less important release. It took the point in good humour. And it is the right place to end, because it names the shift underneath everything: the capability to be the senior has existed for a while; what arrived recently was the economics of the scout — and the economics were the event.

The same machine, pointed everywhere

Once you have a two-speed engine and a shared toolbelt, new agents stop being builds and start being configurations. Three examples from the same stack, each just a North Star and a terminal tool away from the ingester:

The curator — inbound

A tweet or a headline is just a source. The scout walks your worldview anchored on it; the senior, instead of a mutation document, emits a briefing: here's what this means given what you already argue, here's where it breaks, here's the different spin. Terminal tool: a push, not a write.

The proposal compiler — outbound

Same engine run in reverse: the scout gathers a target company's context, the senior diffs it against your compiled expertise and emits a proposal. Inbound news and outbound proposals are one operation — compute the delta between your worldview and an external entity, then make the delta visceral.

The voice hot path — real time

On a live call the frontier model is banned from the hot path by physics — it's too slow to talk. So the fast lane holds the phone on a cheap, cached model, and the slow lane does the authorised work behind it. A compiled per-caller context plus a utility model isn't a cost optimisation here; it's the only way intelligence gets into the call at all.

That last one is the barbell under the harshest possible constraint. Voice and real-time systems fail when one model is forced to be responsive, deep and correct at once; the fix is to split responsiveness from cognition into parallel lanes. Under latency, the barbell argument doesn't weaken — it becomes mandatory.

Key Insight

Call the strategy what it is: context arbitrage. There's a wide, widening price gap between frontier and utility models, and a compiled context lets you capture that spread on every task whose difficulty was really context-depth wearing an intelligence costume.

Context arbitrage

Most enterprise agent work is exactly that kind of task: a simple decision that requires deep context. Triage, routing, drafting, flagging, reconciling — the industry keeps misreading these failures as capability gaps and responding by waiting for a better model or paying for a bigger one. The scout–senior split says the fix was never a smarter model; it was separating the reading from the deciding, so the reading could run cheap and the deciding could run rare. You capture the frontier-to-utility spread on every one of those tasks, using models already on the rankings page.

Why did this only become viable recently? Because the binding constraint was never the quality of judgement — top models have been smart enough to be the senior for a while. It was the unit cost of comprehension: reading everything was too expensive to do at scale. Cheap, cached models collapsed that cost, and the collapse is what moved these systems from clever-but-extravagant to runs-nightly-for-cents. The capability existed; the economics arrived later.

From bet to procurement

Here is what all of this does to the most stressful decision in the field: which model to use. When intelligence lived inside the model and nowhere else, every model choice was a bet on the next release — and every release forced a re-evaluation of your whole stack. Move the durable intelligence into a compiled context, and the model becomes the interchangeable part. The scout tier is a deliberately commodity slot: swap it in an afternoon, audit the swap by diffing the walks, and let it drift cheaper underneath you. The senior tier you upgrade the day a better frontier model ships, without touching anything else, because it only ever does the one terminal pass.

Bottom Line

Model choice stopped being a bet and became a procurement decision — swappable, auditable, and cheap to re-run whenever the market moves. You built the system where the model market can't touch your asset.

And the whole thing compounds. Every ingestion sharpens the graph; every query catches an error and feeds the janitor; every proposal leaves behind context about a company or a vertical that files back as a new source. The machine that turns the flywheel gets slightly better built with each turn — which is the point where a tool stops being a tool and starts being an asset that appreciates.

Build the two-speed agent

If you're running “the smart model does everything” and the bill is climbing, the fix isn't a bigger model — it's a place to put the swap. Split the agent: a cheap, cached scout explores read-only and freezes the transcript; a frontier senior inherits it and signs one governed decision. Put the one cache break where the value is.

That's the work we do at LeverageAI — and if you'd like a second pair of eyes on where your model swap should sit, that's exactly the conversation to start.

Spend frontier cognition on judgement, commodity cognition on gathering. The cheap model got smart enough to be the scout; the economics arrived later — and the economics were the event.
REF
Sources & Evidence

References & Sources

The evidence base behind every claim — primary research, industry analysis, and technical specifications

Research Methodology

This ebook draws on primary research from standards bodies, independent research firms, enterprise technology vendors, and consulting firms. Statistics cited throughout have been cross-referenced against primary sources.

Frameworks and interpretive analysis developed by Scott Farrell / LeverageAI are listed separately below — these represent the practitioner lens through which external research is interpreted, and are not cited inline to avoid self-promotional appearance.

Industry Analysis & Vendor Research

Anthropic — Building Effective Agents [1]

agents dynamically direct their own tool usage and operate for many turns on open-ended problems

https://www.anthropic.com/research/building-effective-agents

Aider — Aider repository map [3]

parse code with tree-sitter, rank symbols by importance, render signatures and structure not full bodies within a token budget

https://aider.chat/docs/repomap.html

LeverageAI / Scott Farrell — Practitioner Frameworks

The interpretive frameworks, architectural patterns, and practitioner analysis in this ebook were developed through enterprise AI transformation consulting. The articles below are the underlying thinking behind those frameworks. They are listed here for transparency and further exploration — not cited inline, as this is the author's own analytical voice.

Scott Farrell — Discovery Accelerators

rejected paths must be first-class outputs; a system that hides what it didn't recommend is an answer generator, not a reasoning partner

https://leverageai.com.au/discovery-accelerators/

Scott Farrell — Micro-Agents Architecture

replace monolithic multi-tool agents with specialised single-responsibility micro-agents in a Router / Supervisor / Worker hierarchy

https://leverageai.com.au/micro-agents-architecture/

Scott Farrell — Context Engineering

context is attention, not just capacity; load only what the current task demands and treat the window as a memory hierarchy

https://leverageai.com.au/context-engineering/

Scott Farrell — Progressive Resolution

work coarse-to-fine, stabilising each layer before adding resolution; place a claim before committing high-resolution detail on top of it

https://leverageai.com.au/progressive-resolution/

Scott Farrell — Decision Authority Infrastructure

AI systems propose actions while authority, policy, and evidence are verified at an in-path enforcement boundary before any action executes

https://leverageai.com.au/decision-authority-infrastructure/

Scott Farrell — The Cognition Ladder

allocate AI value across time-scale rungs; the compounding opportunity is batch and overnight work, not real-time competition

https://leverageai.com.au/the-cognition-ladder/

Scott Farrell — The Cognition Dimension Ladder

a four-rung map of AI cognitive value; the frontier sits where an engine generates and scores futures and a human sits in the judgement seat

https://leverageai.com.au/cognition-dimension-ladder/

Scott Farrell — The Index Is the Data

pre-process the corpus into a self-maintaining markdown wiki-graph of claims and edges; retrieval becomes a single map lookup, not a query-time crawl

https://leverageai.com.au/the-index-is-the-data/

Scott Farrell — The North Star Prompt

give an agent an orienting purpose rather than a prescriptive checklist, and let its judgement decide how to pursue it

https://leverageai.com.au/north-star-prompt/

Scott Farrell — RAG Was Built for Chatbots

a wiki gives the agent the shape of the world in the first prompt, and walking named edges is in-distribution where RAG query formulation is not

https://leverageai.com.au/rag-was-built-for-chatbots-agents-need-a-wiki/

Scott Farrell — The Model Is Not the Memory

compiled context is durable, personal long-term memory; comprehension is paid once and amortised, while model access is a rented, depreciating input

https://leverageai.com.au/the-model-is-not-the-memory/

Scott Farrell — Team of One

a solo operator running shared agent infrastructure can out-iterate a larger organisation because the marginal cost of a new agent collapses

https://leverageai.com.au/team-of-one/

Scott Farrell — Nightly AI Decision Builds

apply CI/CD discipline to a drifting AI system so that regressions surface as a visible queue rather than a silent confidently-wrong answer

https://leverageai.com.au/nightly-ai-decision-builds/

Scott Farrell — Fast-Slow Split

separate conversational responsiveness from heavy cognition into parallel lanes; the frontier model is banned from the latency-bound hot path

https://leverageai.com.au/fast-slow-split/

Scott Farrell — The Drone Is Not the Weapon

a capability can exist long before its economics arrive; the arrival of the economics, not the capability, is the event that changes the world

https://leverageai.com.au/the-drone-is-not-the-weapon/

Scott Farrell — AI Learning Flywheel

a compounding loop where each pass leaves the system better built, so the artifact carries the compounding rather than the human

https://leverageai.com.au/the-ai-learning-flywheel/

Primary Research & Standards Bodies

Anthropic — Prompt caching [2]

cached prefix tokens are billed at roughly one-tenth of the base input rate, so re-sending an unchanged conversation prefix costs a fraction of processing it fresh

https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

About This Reference List

Compiled July 2026. All URLs verified at time of compilation. Regulatory documents and standards specifications are subject to revision — check primary sources for the most current versions.

Some links to academic papers and vendor research may require free registration. Government and standards body publications are freely accessible.

↑ Prev Next ↓