AI Architecture · Field Guide
The Scout and the Senior: Swap the Brain, Keep the Transcript
Frontier-quality agent decisions don’t come from a bigger model. They come from where you put the model swap — and from refusing to summarise the one thing your judgement actually needs.
Scott Farrell · LeverageAI · leverageai.com.au
📚 Read the full field guide
Go deeper — the complete ebook works the full economics, takes apart a production build end to end, and shows how the same machine becomes an ingester, a query agent and a janitor: The Scout and the Senior →
TL;DR
- Split the agent in two: a cheap, read-only scout explores thoroughly and freezes the full conversation; a frontier senior inherits that exact transcript, mid-conversation, and emits one terminal decision.
- Don’t summarise the scout’s walk for the senior. Dead ends are informative — summaries delete them. The standard multi-agent design compresses at exactly the wrong boundary.
- Emit one terminal decision document, not many small mutation calls — and let prefix caching make the whole append-only transcript the cheapest shape, not the most extravagant.
I built two agents that make decisions for a living. One ingests a source into a knowledge graph — reading, cross-referencing, deciding where a new idea belongs and what it connects to. The other maintains that graph, deciding what to merge, split, and retire. Both need genuine judgement. Both also need to read an enormous amount before they can exercise it. For a while I paid for that reading the obvious way: a strong model did the whole job, exploration and decision alike. The bill was awful, and the awful part wasn’t the decision. It was the reading.
That is the observation this piece is built on. In most agent systems that feel expensive, you are paying frontier prices to read, not to decide — and reading is precisely the thing a cheap, cached model now does almost for free. The trick is to separate the two jobs cleanly, hand the whole exploration across a single seam, and place your one expensive model exactly where the value is.
The handoff everyone gets backwards
Start with the mainstream pattern, because it is so close to right that its one flaw is easy to miss. You have a hard task, so you fan out: sub-agents go and explore, each comes back with a tidy summary, and a decider reads the summaries and chooses. Anthropic describes the agentic workload plainly — agents are systems where the model “dynamically direct their own processes and tool usage” and may “operate for many turns” on open-ended problems.1 Fanning out and summarising is the natural way to keep all those turns affordable.
But look at where the compression lands. The sub-agent explored — it checked things and found them irrelevant, it hesitated, it followed an edge and abandoned it. Then it wrote a summary, and every one of those signals evaporated. What reaches the decider is the polished conclusion, stripped of the reasoning texture that made it trustworthy. The lossy step sits at exactly the seam where judgement is about to happen.
Dead ends are informative, and summaries delete them. The decider isn’t just missing what the scout found — it’s missing what the scout ruled out, and why.
So invert it. Don’t summarise the exploration for the decision-maker. Hand over the raw exploration transcript, and swap the brain mid-conversation.
Swap the brain, keep the transcript
Concretely, the pattern is two roles across one shared conversation:
The scout is the cheapest model that can still walk the territory without getting lost. It gets a read-only toolbelt — navigate, fetch a page, pull a code skeleton, grep, read a file — and a single instruction: explore thoroughly, don’t decide anything. Every tool result is written straight back into the conversation, so the transcript accumulates the entire walk. It has no write tool and no decision tool. It ends its turn by calling one thing: review_done.
The handoff is the whole idea. You keep the entire conversation history — every page read, every dead end — and you change three things at once: swap in the frontier model, rewrite the system prompt (“a junior has prepared this exploration; your job is to finalise it”), and grant the write tool. Same transcript, new brain, new remit.
The senior now inherits complete situational awareness for the price of reading it once. It can see not just the conclusions but the shape of the search: where the scout was confident, where it doubled back, which edges it left unexplored. It reads all of that and emits a single terminal decision.
The core move
The conversation is the working set — pre-materialised by a model priced for reading. Swapping model, tools, and framing while keeping the token history is the cheapest possible way to transfer complete situational awareness from a commodity explorer to a frontier judge.
This is not the same as the micro-agent patterns you may already run — a Router handing to Supervisors handing to Workers, or a Director convening a Council of specialists. Those split work across agents that each hold their own context. This splits one task along a time seam, and passes the whole context across it intact. The novelty is the mid-conversation model swap with transcript inheritance — a handoff with zero summarisation loss.
The senior’s blind spot — and the confession field
The pattern has one real failure mode, and it is worth naming before you ship it. The senior’s entire view of the world is framed by the junior’s walk. If the scout never visited a region, the senior cannot see that it’s missing — junior blindness becomes senior blindness. A frontier model reasoning brilliantly over an incomplete map is still reasoning over an incomplete map.
Two cheap mitigations close most of the gap, and they cost almost nothing:
- Inject the top-level map unconditionally. Whatever the scout walked, always give the senior the one-screen overview of the whole territory as a persistent header. Now it can at least spot unexplored ground adjacent to the change it’s about to make.
- Require a confession field. Make
review_donedeclare its own uncertainty: regions I didn’t explore, claims I couldn’t verify, places I was guessing. The junior confessing its blind spots is what lets the senior aim its extra reads — because you gave it read tools too — instead of spot-checking at random.
In practice, the frontier model earns its price exactly here: it re-reads the handful of edges where the junior admitted it was unsure, confirms or corrects them, and only then commits. Exploration stabilises before the decision lands — the same “place it before you claim it” discipline that governs any progressive, coarse-to-fine process.
One terminal decision, not N small ones
Now the economics, because they are the reason the pattern is affordable rather than merely elegant. When the senior finally acts, it is tempting to let it act incrementally: add a source, then add an edge, then update a claim, three tool calls. Don’t. Every tool round-trip re-processes the entire growing conversation as input. Three small mutation calls mean roughly three passes over an ever-larger context — and on a frontier model, that is how a decision that should cost cents ends up costing dollars.
Instead, have the senior emit one big terminal decision document: a single structured output that describes every change at once — pages to create, claims to update, edges to draw. One expensive model, approximately one read of the gathered context, one generation. The scout paid the token-heavy exploration bill on a commodity model; the senior pays once, at the decision point, and no more.
# one terminal mutation document — not N calls { "create_pages": [{ "id": "concept.zero-loss-handoff", "summary": "…" }], "update_claims": [{ "page": "framework.micro-agents", "claim_id": 4, "text": "…" }], "edges": [{ "from": "concept.zero-loss-handoff", "to": "framework.decision-authority", "type": "governed-by" }], "confession": { "unverified": ["…"], "not_explored": ["…"] } }
And that single document turns out to be more than an optimisation. It is a governance artefact. Because the model only proposes and deterministic code applies, you can lint the document before it touches anything real — do the referenced pages exist, are the edge types legal, are the claims being updated actually there? — and reject it back to the senior for one repair round if not. Apply it as a single commit and you get transactionality and replay for free. The scout’s transcript plus the mutation document together are a complete, re-runnable record of what the system knew and what it decided: a decision attestation package for every change the system ever makes. When a choice looks wrong six months later, you can replay exactly what was in front of the model.
Why the cheap, cached model is the one that changed everything
All of this rests on a shift most people under-weight. The frontier model was never the blocker for judgement — top models have been “smart enough to be the senior” for a while. What was uneconomical until recently was reading everything. An exploration-heavy agent spends the overwhelming majority of its tokens on the scout’s walk, and the binding constraint was the unit cost of comprehension, not the quality of the final call.
Two developments removed that constraint. First, cheap models got just smart enough to explore reliably — and past “just enough intelligence,” a smarter scout barely improves the walk while a cheaper scout buys you more walking. Second, and more important, providers started caching the conversation prefix. Without caching, an N-turn agent loop is quadratic: every turn re-processes the whole growing prefix, so cost climbs with the square of the conversation length. With automatic prefix caching, everything the model has already seen is billed at a fraction of the rate — on the order of a tenth2 — so each turn’s marginal cost is roughly just its new tokens. The loop goes from O(N²) to near-linear, and near-constant per turn. In my own production wiki work, that lands at roughly three-hundredths of a cent per turn on the cheap models, regardless of how deep the conversation runs.
Caching is a complexity-class change, not a discount
An append-only transcript is the cache-optimal shape. The moment you reorder tools, edit history, or change the system prompt mid-stream, you invalidate the cache and fall back toward quadratic. Your scout-to-senior handoff does exactly that — once, deliberately, at the point of maximum value. Freeze the prefix for the whole walk; pay for one cache break at the brain swap.
Notice that this is the same insight as the one-terminal-call rule, seen from the other end. Both say: minimise the number of expensive passes over the transcript, and let the cheap, cached passes be as many as they like.
The barbell, and the mid-tier smell
Once you sort your cognition into “exploration” and “judgement,” each pile lands on a natural price point. Exploration is breadth-bound and judgement-light — buy the cheapest model that walks the map. Judgement is quality-bound and volume-light — buy the best model, and the bill stays small because it runs once. The middle gets squeezed from both sides: worse than the cheap model per token at reading, worse than the frontier model per decision at deciding.
Which yields a design heuristic you can run in reverse. If a task seems to want a mid-tier model, that’s a smell. It usually means gathering and judging haven’t been separated yet. Mid-tier demand is architectural debt announcing itself — pull the task apart into a cheap scout and a frontier senior, and the middle disappears.
It also changes how you choose models. When the scout is a deliberately commodity slot, a model swap becomes an auditable bake-off: run the same exploration on two candidates and diff the walks — did the challenger visit the pages that mattered, or wander? The popularity rankings of people paying per token under real agentic load are a better signal for the scout tier than benchmarks measuring peak intelligence, because they reveal the price-competence frontier rather than the ceiling. Model choice stops being a bet and becomes a procurement decision you can re-run whenever the market shifts.
How to build it
What most teams ship
- Sub-agents explore, then summarise for a decider
- One strong model does exploration and decision alike
- Incremental mutation calls, each re-reading the context
- History edited mid-stream, quietly breaking the cache
- A mid-tier model chosen as a “balanced” compromise
The two-speed agent
- Scout explores read-only; the full transcript is kept
- Swap model + tools + system prompt over the frozen transcript
- Senior emits one terminal, lintable decision document
- Append-only prefix, one deliberate cache break at the swap
- Cheapest cached scout + smartest frontier senior, nothing between
A minimal build is honestly small. Give the scout a read-only toolbelt and a frozen system prompt so the whole walk caches. Route its exploration on the cheapest model that passes your path test. Freeze the transcript, swap in the frontier model with a “finalise this” prompt and a write tool, and always inject the top-level map. Require a confession field in review_done. Take the senior’s single mutation document, lint it, apply it deterministically as one commit, and keep both the transcript and the document as your audit trail.
One honesty pass, because the pattern is strong enough to state without swagger. It does not help where the task is closed-world and tool-shaped — an agent whose entire knowledge lives in its APIs doesn’t need a scout. The pattern earns its keep wherever difficulty is context-depth wearing an intelligence costume: the decision is simple once you’ve read enough, and reading enough is the expensive part. That describes an enormous amount of real agent work — triage, routing, drafting, reconciling — and almost none of it needs the frontier model for anything but the final call.
Spend frontier cognition on judgement, commodity cognition on gathering. The cheap model got smart enough to be the scout; the economics arrived later — and the economics were the event.
Go deeper
This article is the spine of a longer field guide — The Scout and the Senior — which works the full economics (the O(N²)-to-linear arithmetic, the mid-tier smell, real per-turn costs), takes apart one production build end-to-end, and shows how the same machine becomes an ingester, a query agent, a janitor and a curator by changing only its North Star. If you’re building multi-stage agents and the “smart model does everything” bill is climbing, read the ebook — the link is in the post above. And if you’d like a second pair of eyes on where your model swap should sit, that’s exactly the work we do at LeverageAI.
References
- [1]Anthropic. “Building Effective Agents.” — “systems where LLMs dynamically direct their own processes and tool usage” and “operate for many turns” on open-ended problems. https://www.anthropic.com/research/building-effective-agents
- [2]Anthropic. “Prompt caching.” Cached prefix tokens are billed at roughly one-tenth of the base input rate, so re-sending an unchanged conversation prefix costs a fraction of processing it fresh — the mechanism that turns an append-only agent loop from quadratic to near-linear per-turn cost. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- [3]Aider. “Repository map.” Battle-tested precedent for the scout’s read layer: parse code with tree-sitter, rank symbols by importance, and render signatures and structure, not full bodies, within a token budget. https://aider.chat/docs/repomap.html
Discover more from Leverage AI for your business
Subscribe to get the latest posts sent to your email.
Previous Post
The North Star Prompt: Stop Writing Specs for Models That Can Think