Part I · Why Retrieval Thinks at the Wrong Time

The Re-Reading Tax

Your retrieval system re-reads your entire world every time you ask it a question. That bill is bigger than you think — and you stopped noticing it a long time ago.

Picture the most over-qualified assistant you have ever hired. They have read everything you have ever written. Every email, every document, every decision. And yet, every single time you walk into a meeting with them, they introduce themselves again. They have no memory of yesterday's conversation. They re-read the relevant files on the spot, in front of you, and then offer a confident answer assembled from whatever they managed to skim in the last ten seconds.

That is what most teams have actually built when they put a large language model on top of their own documents. Ask it something on Monday and it searches. Ask it the same shape of question on Tuesday and it searches again — from scratch, as though it had never seen your corpus before. It is a very well-read stranger who re-introduces themselves at every meeting.

This book is about why that happens, and why the fix is none of the things you would reach for first. Not a bigger model. Not a better vector database. Not a cleverer prompt. The fix is a change in when the thinking happens. But before we can move the thinking, we have to look honestly at what the current approach costs — and where it genuinely earns its keep.

What RAG is, and where it is genuinely good

The default pattern is retrieval-augmented generation — RAG. Spelled out, it is mechanical and sensible: chop your documents into chunks, turn each chunk into a vector, and at the moment a question arrives, fetch the handful of chunks that look most similar to the question and hand them to the model to write an answer.

For a whole class of questions, this works beautifully. "Find me the paragraph about the renewal clause." "What did the Q3 report say about churn?" When the answer is a passage that lives in one place and the raw text is what you actually want, vector search is fast, cheap, and good. We should say this plainly, because the rest of the book is a critique, and a critique that pretends RAG is worthless is a critique nobody serious will trust. When the raw data itself is what you need, RAG can perform well.

The trouble starts the moment your questions stop being about a single passage and start being about how things relate.

Failure one: there is no slot for relationships

Vector similarity matches text that looks like your query. That is all it does. It has no representation for "this requirement is defined in document A, constrained by document B, and exceptioned by a clause buried in contract C." The relationship between those three documents is the answer — and the relationship is precisely the thing chunk similarity cannot see.

This is not a fringe complaint. It is the reason Microsoft built GraphRAG in the first place. Their research team states it flatly: baseline RAG "fails on global questions directed at an entire text corpus … since this is inherently a query-focused summarization task, rather than an explicit retrieval task."¹ Elsewhere they put it in plainer words: baseline RAG "struggles to connect the dots … when answering a question requires traversing disparate pieces of information through their shared attributes."²

Chunking makes it worse. Fixed-size splits cut straight through the logic of a document. Microsoft's own architecture guidance warns that fixed-size chunking "isn't recommended for text that requires semantic understanding and precise context," because "relevant context that exists across multiple chunks might not be captured."³ Each chunk becomes an island, divorced from its place in the larger argument. You get half a requirement in one fragment and half in another, and neither makes sense alone.

Name the thing honestly, because we will be fighting it for the rest of the book: the enemy is the query-time discovery model — the unspoken assumption that "make the AI know my domain" means "bolt on a vector store and search it live, on every question."

Failure two: the "retrieve more" reflex backfires

When retrieval misses, the instinct is to retrieve more. Widen the net. Pull the top fifty chunks instead of the top five. Stuff a bigger context window. This is exactly backwards, and the evidence against it is now overwhelming.

>30%

accuracy drop when the relevant fact sits in the middle of a long context, not the ends.⁴

99.3 → 69.7

even GPT-4o's score collapses as the context grows under the NoLiMa benchmark; at 32K tokens, 11 leading models fall below half their short-context baseline.⁵

18 / 18

frontier models tested by Chroma all degrade as input grows — "often in surprising and non-uniform ways."⁶

The classic study here is "Lost in the Middle": model accuracy is highest when the relevant information sits at the very start or very end of the context and sags when it is buried in between. The NoLiMa benchmark sharpened the point for the long-context era, and Chroma's testing across eighteen frontier models found the same shape everywhere. The conclusion is uncomfortable for anyone whose plan is "use a bigger window": more chunks do not mean better answers. A larger context window is not a substitute for a better-shaped context.

The tax, paid twice

So the query-time discovery model pays a tax, and it pays it twice. First on discovery: many round-trips to assemble one answer, slow and token-heavy⁷⁸. Then on the payload: even when retrieval "succeeds," the bloated stack of chunks it returns actively dilutes the model's attention, by the very mechanism the studies above describe.

Giving an agent search tools — even good ones like RAG — there are a lot of circumstances where discovery is slow, token-bloated, and returns stuff you didn't want. It takes many searches just to understand the relationships.

There is a deeper way to say what is going wrong. RAG retrieves text. It does not retrieve understanding. When you ask a relationship-shaped question, the understanding you need is not sitting in any single chunk waiting to be matched — it has to be assembled by reasoning across documents. Knowing is not the same as retrievable. And when the knowing was never built, no amount of searching at query time can conjure it.

A note on two neighbouring problems, so we can set them aside cleanly. If you have also hit context bloat from the tooling side — a stack of MCP servers eating fifty thousand or more tokens before your agent thinks a single thought⁹ — that is a real problem, but a different one; we have written about why code-first agents beat MCP on context elsewhere. And the idea of a "routing index" that gives an agent information scent before it searches is its own subject, which we cover in our work on Context Engineering and the Cognition Supply Chain. This book is about what happens upstream of retrieval entirely.

A scene you will recognise

A support lead asks the company assistant: "How does our refund policy interact with the enterprise SLA exception for region X?" The system returns three chunks — one mentions refunds, one mentions the SLA, one mentions region X. Each is individually relevant. None of them answers the interaction, because the interaction is a relationship and the relationship is not in any chunk. So the analyst runs five more searches and stitches the answer together by hand. The diagnosis is simple: the answer was a relationship, and relationships were never encoded.

That last point is the hinge for everything that follows. The understanding was never built. It is re-derived, badly, under latency pressure, by similarity math that cannot see relationships. So here is the question that opens the next chapter, and the rest of this book: what if we built the understanding before the question instead of after it?

02

Part I · Why Retrieval Thinks at the Wrong Time

The Index Is the Data

Stop searching your data. Pre-digest it. The win was never a better search at query time — it is a map built before the question is ever asked.

In April 2026, Andrej Karpathy gave a name to something a number of us had been quietly doing for years. He called it the "LLM Wiki," and described it like this: rather than retrieving from raw documents at query time, "the LLM incrementally builds and maintains a persistent wiki — a structured, interlinked collection of markdown files," so that "the knowledge is compiled once and then kept current, not re-derived on every query."¹⁰

I had been running a version of this over my own inbox for a couple of years before there was a name for it. When the field's most-watched engineer independently names the thing you have been doing, it is worth stopping to write down precisely why it works. That is this chapter.

Move the work before the question

Recall the diagnosis from the last chapter: RAG re-derives understanding on every question because the understanding was never built. The relationships in your domain are real, but they live nowhere — they get re-inferred, under latency pressure, by similarity math that cannot represent them.

The move is simple to state and consequential to adopt. Do the work off-cycle, once per source, before anyone asks anything — and bake the result into a structure that holds relationships natively. You are not just indexing raw data; you are pre-digesting it into a conceptual worldview. The intelligence gets injected before the user ever asks a thing.¹¹

This is the whole reframe, and it deserves a sentence you can carry around: the index stops being a pointer to the data, and the index becomes the data. Everything else in this book is a consequence of taking that sentence seriously.

The artefact: claims and edges, not chunks

The word "wiki" undersells it, so let us be precise about the artefact, because the precision is the point. A page here is not prose. It is a set of claims — atomic, dated statements — and edges — typed links between them: related, supersedes, a link to [[Project X]]. The claims hold the facts. The edges hold the relationships. And once the relationships live in the structure, the model navigates them instead of re-inferring them from chunk similarity every single time.

Call it the map over the swamp. Language models are superb at navigating structured, interlinked text and clumsy at wading through raw piles of it. Transform the messy inputs into a clean map of claims and edges, and the model gets to do the thing it is good at.

RAG vs. the wiki-graph (the reference table for this book)

Dimension	Standard RAG	Wiki-graph
Primary mechanism	Vector similarity on raw text chunks	Parallel map lookups over relationships
Token efficiency	Bloated; noise alongside the signal⁷	Lean; pre-distilled conceptual pages
Relationship-aware	Poor; many sequential searches to link ideas	Native; baked into the edges
Retrieval pattern	Slow, multi-step tool-calling loops	One parallel call to a few precise pages
When the work happens	At query time, every time	Once, off-cycle, then kept current

This is the spine of the argument. Later chapters point back to this table rather than redrawing it.

"The index became the data"

Here is the claim in the most concrete terms I can give it, from the project we will meet properly in Part II: the index became the data. Even with full access to read the original email, search the email database, and run a RAG search, the system basically did not need those tools — it answered a high percentage of questions straight from the wiki.

It is worth being careful about attribution here, because precision builds trust. What Karpathy is owed is the clean public framing of the pattern: a persistent wiki, built incrementally, compiled once and kept current. The added emphasis — that the pre-processing is almost as important as the wiki itself — is my own, drawn from years of watching where the value actually accrued. (For the same reason, a much-quoted line about "Markdown and Git as the only source of truth" lives in the comments under Karpathy's note, not in his own text; we will make the ownership argument on firmer ground in Chapter 4.)

One level up: outputs are cheap, the source is the asset

There is a principle underneath all of this that we have written about as Worldview Recursive Compression: frameworks function as source code, while generated outputs are regenerable artefacts derived from that source. Apply it here and the picture snaps into focus. The wiki-graph is the compiled source-code of your corpus, not a cache of it. The raw emails and documents are regenerable raw material. The map is the durable asset. Lose the raw data and you can rebuild the map slowly; lose the map and you are back to being a well-read stranger.

Take the support question from Chapter 1 — "how does our refund policy interact with the enterprise SLA exception for region X?" — and imagine it answered from a wiki-graph instead. There is a Refund Policy page with an edge exception-for → [[Enterprise SLA]] and a region-scope edge to region X. One lookup returns the relationship, intact. Not three disconnected chunks that each mention a piece. The relationship was pre-built, so the answer is already there.

There is one catch the reframe quietly introduces. A map you have to maintain by hand goes stale the day you stop tending it. The whole idea only pays off if the maintenance is automatic — so the next question is the one that defines the rest of Part I: how does the map maintain itself?

03

Part I · Why Retrieval Thinks at the Wrong Time

The Dual-Agent Engine

Two agents, the same tools, opposite jobs — and the one that subtracts is where the intelligence accrues. This is the kernel of the whole doctrine.

A map that maintains itself is the difference between a clever idea and a system you can actually live with. The way you get there is almost suspiciously simple: two agents, given the same toolset and a single shared instruction, doing opposite jobs. One builds. One compacts. The compacting is the part most people skip, and it is the part that does the real work.

This chapter states the engine in the abstract — no inbox, no specific domain — because everything in Parts II and III is this same engine pointed at different corpora. Learn it once here; recognise it everywhere after.

The engine, in one picture

[ New source ]
      |
      v
+--------------------------+    reads current wiki,
|   1. INGESTION AGENT     | -> writes / updates
|        (the builder)     |    claims & edges
+--------------------------+
      |
      |  (page crosses ~12 claims / size threshold)
      v
+--------------------------+    combines, fades stale,
|   2. JANITOR AGENT       | -> converts claims -> edges,
|       (the compactor)    |    spins off new pages
+--------------------------+
      |
      v
[ Leaner, smarter wiki-graph ] -> next source starts from here

The self-maintaining loop. Four moving parts: single-source ingestion, janitor compaction, a North Star, and chronological stacking. Every variant in this book assembles these same four.

Component A — the Ingestion Agent (the builder)

The builder processes inputs one at a time. This constraint matters more than it looks. A single source is small enough to reason about properly; batch ingestion quietly re-introduces the bloat problem we just spent a chapter escaping. The builder reads the current state of the wiki first, then writes — updating an existing claim, drawing a new edge, or spawning a fresh page. It edits the map; it does not append blindly to it.

In practice a single source touches a handful of pages, not the whole graph — Karpathy notes a source "might touch 10–15 wiki pages."¹⁰ That locality is what keeps each ingestion decision high-quality and, crucially, auditable.

Component B — the Janitor Agent (the compactor)

Left alone, an append-only wiki degrades into a text dump — exactly the swamp we were trying to avoid. The Janitor is what prevents entropy. It wakes on page bloat: once a page crosses roughly a dozen claims, or a size threshold, it gets handed over for trimming.

Its jobs: combine redundant claims, fade stale ones, spin clusters into their own pages, and — the load-bearing move — convert flat claims into edges. "Met with X on Tuesday" stops being a standalone claim and becomes an edge to [[Project X]]. The whole intent is to reduce the number of outright claims, so that more of the information is held in the edges, in the connections.

The index didn't just get smaller. It got smarter with every janitorial pass.

This is not an exotic idea. It is the oldest idea in cognition, automated. In the "Generative Agents" research, the agents that stayed coherent over long horizons were the ones that periodically generated reflections — higher-order, abstract memories synthesised from scattered raw observations and stored back in their place. Remove the reflection step and the agents degenerated into repetitive, context-free behaviour.¹³ The Janitor is reflection, running on your corpus. It is what stops the system from drowning in its own notes.

Component C — North Star, not rules

The agents are not governed by a long list of prescriptive instructions. They are governed by a single directive — a North Star. I made the agent prompts less prescriptive and more general, and handed them one guiding intent instead of a rulebook. That choice is doing more work than it appears to.

Why does a heuristic beat hard-coded rules here? Because the decision the Janitor faces on every old claim is genuinely a judgment call: is this noise to delete, or context to abstract into an edge? You cannot write that rule in advance for every case. You can give the agent a clear intent — "act as a personal assistant building a consistent worldview; favour recent information over old" — and let it decide. The index does not just get tidier; it gets wiser about what to keep.

Component D — chronological stacking (decay for free)

Append claims in date order, so the oldest float to the top of the page. That is the entire temporal-decay mechanism. No timestamp-delta arithmetic, no time-to-live fields, no expiry jobs.

By putting the older claims at the top, you have handed the model a timeline view. It reads top to bottom, sees how a thought or a project evolved, and can tell at a glance when an older claim has been superseded by a newer one below it — then distil that history into a compressed, relational form. Decay stops being a calculation and becomes something the agent simply reads.

Two agents, one North Star

Component	Job	Trigger
Ingestion Agent	Read wiki, write/update claims & edges	A new source arrives
Janitor Agent	Combine, fade, convert claims → edges, spin off pages	A page crosses ~12 claims

The design choice that makes it self-maintaining: the Janitor gets the exact same tools as the Ingestion Agent — create pages, create edges, combine, delete — and the same North Star. That symmetry is what turns maintenance into genuine executive function rather than a dumb cleanup script.

Watch the four parts move together on a neutral example. A claim is added on day one — "decided to use Postgres." Three more accumulate over the following weeks. The page crosses the threshold, and a Janitor pass folds them into a [[Datastore Decision]] page with an edge superseded-by → [[Move to Aurora]] once a later source arrives. Ingest, accumulate, compact, connect. The raw claims shrink; the relationship survives, sharper than before.

That sharpening is not free, of course. The Janitor burns tokens off-cycle, forever, whether or not anyone asks a question that day. So why is that a better deal than RAG's seductive "free until you ask"? That is the economics — and it is the subject of the next chapter.

04

Part I · Why Retrieval Thinks at the Wrong Time

The Economics of Knowing

Pre-processing looks expensive. RAG looks free. Both impressions are wrong — and the difference is where you decide to spend, and whether the spend leaves an asset behind.

Your competitor can buy your model tomorrow. They can subscribe to the same vector database, copy your embeddings, even hire your engineers. What they cannot buy is two years of compaction — the edges your Janitor drew while they were still re-searching raw documents on every question. That is the whole economic argument in one sentence. The rest of this chapter earns it.

Pre-compute once vs. re-search forever

RAG looks free because it does nothing until asked. That is exactly what makes it expensive. The instant a question arrives, it pays the full discovery tax — and it pays that tax again on the next question, and the next, for every user, forever. The cost is real; it is just spread thinly enough across queries that nobody itemises it.

The wiki-graph pays the cost once per source, off-cycle, and turns retrieval into a single cheap parallel lookup. The cost did not vanish. It moved to where it amortises.

~2.2×

the token cost of advanced, multi-search RAG versus a single-shot baseline — before you count the latency. The lean alternative is one parallel page-pull.⁷

The arithmetic favours the map more than the headline number suggests, because the multi-search loop is not a one-off — it recurs on every interaction, while the pre-processing happens once and then sits there as an asset. Frame it as a balance sheet and the contrast is stark: the index is capital; live search is an operating expense that never stops.

RAG looks free because it does nothing until asked — then it pays full price on every question, forever. The index is an asset. Live search is an expense that recurs.

Freshness versus stability — resolved, not traded

Most "auto-updating" knowledge systems make you choose your failure mode. Freeze the index and it goes stale. Let it rewrite itself freely and it churns into chaos. Pick your poison — unless the maintenance itself is intelligent.

We have already built the resolution, in Chapter 3, so there is no need to re-argue it here: chronological stacking keeps freshness legible, and the Janitor keeps stability by compaction. Old claims float up and get folded into an edge — not frozen, not churned. That is why the index improving on each pass is what lets the economics compound rather than decay. A system that got messier with every update would make pre-processing a liability. One that gets leaner and smarter makes it an investment.

Architecture compounds; models do not

A model upgrade is a one-time lift. You move from one version to the next, outputs improve a little, then plateau until the next release. No compounding. Architecture is different: every edge the Janitor draws makes the next retrieval better. The system improves because the map improves, not because the weights changed.

We think the durable asset is the map, not the model. You rent the model; you own the map. This is the same compounding loop we have described as a kernel flywheel — better map leads to better retrieval leads to better outputs leads to patterns worth keeping leads to a richer map. And it is where the moat actually lives. A compacting asset that compounds beats an expense that recurs, every time, on a long enough horizon.

And you own all of it

The artefact is plain markdown. That is not an aesthetic preference; it is the ownership story, and for a business it may be the most important part. A markdown wiki is inspectable by a human, diffable in Git, portable across model providers, and locked inside no vector database. When the AI gets something wrong, you open the file and see why. When the Janitor makes a bad consolidation, you revert the commit.

The wider field is converging on exactly this. Anthropic, writing on context engineering, names "structured note-taking, or agentic memory" — where "the agent regularly writes notes persisted to memory outside of the context window" — as a first-class pattern, and states the trade-off that makes the whole book's case in nine words: "runtime exploration is slower than retrieving pre-computed data."¹⁴ Anthropic now even ships a file-based memory tool so agents can "build up knowledge bases over time … and reference previous learnings without having to keep everything in context."¹⁵ The durable asset is the map; every vector index built on top of it is disposable and rebuildable from the text.

The shape of the spend

Two teams, same model, same corpus. Team A runs live multi-search RAG: every one of ten thousand monthly questions pays the discovery tax in full. Team B pre-processes once per source and answers from roughly six pages per question. The difference is not the technology — it is when the tokens were spent, and whether the spend left an asset behind. Team A's spend evaporates with each query. Team B's spend accumulates into a map that makes the next query cheaper. (An illustration of the economics, not a benchmark.)

That is the doctrine, the worldview, and the economics. Enough theory. The next two chapters show the whole thing running in the wild — a system I fed one email at a time, for two years, and what it could do by the end.

05

Part II · The Two-Year Build: A Worldview From One Email at a Time

The Build

The doctrine, running in the wild. One mundane instruction, repeated for two years, that produced something uncanny.

I gave an AI one job for two years: read one email at a time, read the wiki, update it. Then do it again. Forever.

That is the entire setup. There is no clever trick hiding in it — no proprietary model, no exotic retrieval stack. And yet what it grew into was an assistant that understood what I thought was important, in a way that still slightly surprises me to describe. Part I gave you the pattern in the abstract. This chapter is the pattern lived: the same four components from Chapter 3, running over an inbox, with all the texture the abstract version leaves out.

The North Star that ran it

Everything started with one sentence. I gave it a North Star: it was my personal assistant, and I was building a worldview for that assistant. And one directive about time: favour recent information over old.

I deliberately made the agent prompts less prescriptive and more general. This runs against the instinct, which is to over-specify — to write pages of rules anticipating every situation. The opposite worked better. With a clear North Star, the agent handled the edge cases a rulebook never could have enumerated, because it was reasoning toward an intent rather than matching against a checklist. (We covered why a heuristic beats hard-coded rules in Chapter 3; here you are watching it pay off.)

Ingestion in practice

The builder's loop was exactly the abstract one from Chapter 3, made concrete. One email arrives. The agent reads the current wiki, then decides what to do: add a claim, update an existing one, draw an edge, or spin up a new page. Free to create new pages and new connections whenever the email warranted it.¹⁰

The discipline of one email at a time is what kept the quality high. Each email was a small, bounded decision — the agent was editing the map, not dumping text into it. A claim looked like a dated, atomic line on a page: on a [[Project X]] page, "2024-03-02 — agreed scope cut with vendor." An edge looked like a link folded in later: [[Project X]] —involves→ [[Vendor Y]]. New claims were appended in date order, which — per Chapter 3's fourth component — meant the oldest naturally floated to the top of each page.

The Janitor in practice

Then there was the second agent. Once a page passed about a dozen claims, I handed it to the Janitor with the simplest possible instruction: here is a wiki page that needs trimming. It had the same toolset as the builder and the same North Star — that symmetry, again, was the point.

What it did with that freedom is the part that surprised me. It combined related claims, faded the old ones, spun clusters into their own pages, and converted claims into edges — so that more of the information ended up held in the connections than in standalone statements.¹³ It autonomously decided whether an old claim was irrelevant noise to be deleted, or historical context that needed to be abstracted into a connection. That is a judgment call, made hundreds of times, without my supervision — and the North Star kept it on track.

One email's journey through the loop

T0 — an email arrives. "Vendor Y confirmed the March scope cut."

Ingest. The builder adds a dated claim to [[Project X]]. It notices Vendor Y has no page, and creates [[Vendor Y]].

Weeks later — the page hits 13 claims. The Janitor wakes. It folds three scope-related claims into a single scope summary, fades a superseded date, and draws [[Project X]] —involves→ [[Vendor Y]].

Result. The raw email is now redundant to the worldview. The map holds the relationship. Retrieval, when it comes, will read the map — not the inbox.

What made it feel different

By the end it was not storage. Storage is dumb; this was not. It had a model of what I considered relevant and important. I will save the payoff metric for the next chapter, because it deserves its own page — but the texture is worth naming now.

It could truth the data it wanted, and it understood the relationships.

That verb — "truth" as a thing you do to data — is the most honest word I have for it. The system did not just retrieve the right facts; it had a stance on which facts were load-bearing and how they connected. Two years of one-email-at-a-time ingestion, and a Janitor that never stopped consolidating, had compiled a worldview.

The obvious question is what all that quiet work actually bought at the moment I finally asked it something. So let us watch a single retrieval, up close.

06

Part II · The Two-Year Build: A Worldview From One Email at a Time

The Payoff and the Fallback

Six pages. One parallel call. No follow-up search. And then the honest part: where this still bites, and where RAG is still the right tool.

Here is the moment two years of compaction paid off. When the retrieval agent went looking for an answer, it requested about half a dozen wiki pages in one parallel call — and it was so accurate that it did not need to follow up with any email searches. It suddenly knew exactly which connections it needed, really quickly. The five-search crawl from Chapter 1 collapsed into a single lookup.

And the part that still lands hardest: even with full access to read the original email, search the email database, and run a RAG search, the agent basically did not need those tools. A high percentage of questions it answered straight from the wiki. The index had become the data. You have, in effect, built a custom semantic caching layer that mirrors long-term human memory consolidation¹³ — the relationships are pre-resolved, so the lookup is trivial.

6 pages

1 parallel call

No follow-up search. Its understanding of "what I thought was relevant and important" was, in the only word that fits, crazy accurate.

This is exactly where the wider field has landed, and it is worth letting an outside voice say it. Anthropic, writing on context engineering, names the precise trade-off this whole book rests on, in nine words: "runtime exploration is slower than retrieving pre-computed data."¹⁴ Pre-compute the understanding, and retrieval is cheap. The same write-up elevates "structured note-taking, or agentic memory" to a first-class pattern — the agent writing notes outside the context window and pulling them back later. The industry converged on pre-processing because the alternative does not scale.

One deliberate exclusion: keep the numbers out

There was one counter-intuitive rule that made retrieval sharper, not weaker. I told the agents not to record actual numbers and figures in the wiki. So when I genuinely needed a hard metric, I did have to go back to the email for it — but the wiki knew exactly which source to point at.

This is a feature, not a gap, and the reasoning generalises. Relationships and figures have different failure modes. A stale relationship is usually still directionally true; a stale number is simply a wrong answer. So keep the graph for relationships — the thing it is uniquely good at — and let it route you to the authoritative source for the digit. The graph optimises for worldview; it knows precisely which raw document holds the metric, which means no blind semantic search when precision matters.

This setup essentially mirrors human memory consolidation during sleep — transferring short-term, messy daily logs into long-term, highly efficient relational networks.

That analogy, flagged back in Chapter 3, is the one I keep returning to. The Janitor is doing what a brain does overnight: taking the day's scattered, noisy observations and consolidating them into compact, connected structure. By morning, you do not remember every individual moment — you remember what they meant, and how they fit. That is what a good index becomes.

The honest part

A doctrine that only tells you where it wins is marketing, not engineering. So here is where this approach is not the answer, and where it can bite. This is the reference point the rest of the book returns to whenever the question of limits comes up.

Hybrid, not holy war

✓ When RAG is still right

Low-latency, single-document, well-formed questions — "find me the clause about X." This is not "RAG is dead." The honest architecture is hybrid: RAG as the broad net for fast candidate-fetching, the wiki-graph as the relationship map. Raw data, vector index, and graph coexist; the graph is just where understanding lives. (See the "When RAG is enough" list in Chapter 1.)

⚠ The open risk: hallucinated consolidation

A Janitor compacting under a loose North Star can merge two genuinely distinct ideas because they happened to be old and adjacent. This is a real failure mode, not a hypothetical — it is the question any honest practitioner asks first.

🛡 The mitigations

The ownership properties from Chapter 4 — chronological legibility, a human-auditable Git diff, instant revert — plus a periodic lint pass. Karpathy's version health-checks the wiki for "contradictions between pages, stale claims that newer sources have superseded, orphan pages with no inbound links."¹⁰

Say the principle plainly, because it is the one that keeps this approach trustworthy: self-maintaining does not mean unsupervised. It means the supervision is cheap and legible. You are not watching every consolidation; you are reviewing a diff when something looks off, and running a lint pass before you trust the map unattended. Pretending the Janitor is risk-free would be the dishonest move — and it would be the move that gets someone burned.

This ran on my inbox. But nothing about the loop is personal. Claims, edges, a builder, a janitor, a North Star — point that engine at a team's codebase, a tender response, or a pile of research papers and the same machine runs. Part III is three such variants.

07

Part III · The Pattern Everywhere

The Team Org Brain

The same loop, pointed at a team's collective knowledge. The Janitor becomes the thing every engineering team swears it will do and never does.

Every engineering team has a lesson it re-learns every sprint. The auth-token refresh that is more complicated than it looks. The service with the race condition nobody should build on. The rule that you do not deploy at four o'clock on a Friday. Each one gets rediscovered the hard way by whoever did not happen to be in the room the last time it bit.

The memo that would have prevented it does not exist — not because nobody believes in writing things down, but because nobody compacts the learnings into anything durable. The fix is a Janitor that does it for them. This is the first of three variants, and the rule for all three is the same: it is the same engine from Chapter 3, pointed at a new corpus. No new framework — a new domain.

The four components, mapped to a team

You already know the parts, so this is quick (the mechanism lives in Chapter 3; here we only re-aim it):

Ingestion

Pull requests, tickets, incident post-mortems, and design docs, ingested one at a time into claims and edges.

Janitor

Compacts repeated learnings into edges and canon once a page crosses the threshold.

North Star

"Maintain a faithful, current model of how this system actually behaves and why; favour recent decisions over old."

Chronological stacking

Decisions appended in order, so superseded ones float up and get folded.

The Janitor is an automated Learning Extraction Ritual

We have written before about a discipline we call the Learning Extraction Ritual: at the end of every piece of work, capture three things — the sticking points, the surprises, and the never-agains — and feed them into the team's canon. It is a genuinely good practice. It also has one fatal weakness: humans do not do it consistently. Teams embrace it intellectually, add it to the checklist, and weeks later it has quietly fallen away — precisely when they are busiest, which is precisely when the learnings are most valuable.

The doctrine closes that gap. The Janitor runs the ritual automatically. It reads the conversation history, the PR threads, the incident notes, and extracts the invisible learnings into edges — so the ritual sticks without anyone having to find the discipline to perform it. The machine does the part humans reliably skip.

The durable asset is the map, not the memo. A memo nobody compacts is just entropy with good intentions.

Why chunks fail here specifically

Team knowledge is dependency-shaped, the way Chapter 1 described — a decision points to its rationale, which points to the constraint that forced it, which points to the outcome that validated or killed it. A chunk store loses the chain.³ Ask it about the decision and it returns the decision, or the rationale, but not the relationship between them. The edges are exactly what make a decision navigable:

[[Move to Aurora]]
   --because--> [[Postgres write-throughput limit]]
   --validated-by--> [[Q3 load test]]

And because the artefact is markdown in Git, a Janitor consolidation is a reviewable change — a pull request a human approves — not a silent mutation of a black-box store. That review gate is the safety the single-user inbox of Part II never needed, and it is what makes the pattern work for a team rather than a person. The map is the team's Org Brain: owned, audited, and portable across whatever tools each engineer happens to use. (Hybrid and the consolidation risk are covered in Chapter 6; nothing changes here except who clicks approve.)

Mini-case: the Friday deploy

An incident post-mortem lands and becomes three claims on the relevant service page: "deployed at 4pm Friday," "cache cold on first request," "alert fired 40 minutes late." A quarter later, a second Friday incident adds two more claims. The page crosses the threshold.

The Janitor pass folds them into a [[Deploy Policy]] edge — [[Service A]] —avoid-deploy→ Friday-afternoon — linked to both incidents as evidence. A reviewer approves the PR. The next engineer who asks "is a Friday deploy safe for Service A?" gets the answer from one edge. The team stopped re-learning it.

A team's stakes are velocity — not re-paying for lessons it already bought. Raise the stakes from internal velocity to external accountability — a tender you can lose, an auditor you must satisfy — and the same loop earns its keep in a different currency. That is the next chapter.

08

Part III · The Pattern Everywhere

Compliance and Proposals

The same loop on the highest-stakes relationship-shaped question there is: "did we actually answer the clause?" Here the payoff is not speed. It is traceability.

Your retrieval returns three paragraphs that mention safety. The evaluator's question was narrower and unforgiving: did the bid address clause 3.2.1(b) — AS/NZS 4801 compliance — with evidence? Mentioning safety is not answering the clause. And that gap, the distance between "retrieved relevant text" and "proved the specific requirement is met," is where tenders are lost and audits go sideways.

This is the second variant, and it is the one where the doctrine's logic is at its most literal. Requirements and evidence are claims and edges. We have proved this in production before; the self-maintaining loop is what makes it durable.

Why chunks fail here, specifically

Chapter 1 made the general case against chunking. Compliance sharpens every edge of it:

Requirements span clauses. A single requirement can run 800 tokens across nested sub-clauses; fixed-size chunks fragment it, leaving half a requirement in one chunk and half in another.³
Keyword mismatch. The tender says "WHS"; your proposal says "Work Health & Safety framework." Similarity catches some of that and misses the regulatory synonyms that matter most.
No requirement-to-evidence mapping. You retrieve similar text but cannot prove it satisfies clause 3.2.1(b). And with no mapping, there is no way to produce a compliance matrix — "127 requirements, 119 addressed, 8 gaps needing input."

The four components, mapped to requirement–evidence

The same engine, re-aimed (mechanism: Chapter 3). Each requirement is parsed as a claim with a stable identifier — RFP_SEC3.2_REQ007. Each proposal capability is evidence. The edges are typed requirement-to-evidence links, forming a bipartite graph: [[REQ007 AS/NZS 4801]] —satisfied-by→ [[Cert: OHS Mgmt System]]. This is the claims-and-edges doctrine in its most auditable possible form. The Janitor keeps the graph current as capabilities and evidence change — superseded certificates faded rather than stranded, unmet requirements surfaced as edgeless nodes. This generalises the compliance-first shaping from our Intelligent RFP work; the loop is the new part.

The self-maintaining angle is the new contribution

Our prior art proved a static graph beats chunks. The doctrine adds the loop, and the loop is what matters in this domain: tenders recur, certificates expire, capabilities evolve. The Janitor keeps the requirement-evidence graph live, so the next bid starts from a current map rather than a fresh re-scrape of last year's documents. A recertification event arrives, and the Janitor invalidates the old evidence edge, links the new certificate, and leaves the requirement satisfied — without a human re-mapping anything.

Traceability is the payoff — not speed

In Part II the headline was a fast lookup. Here the headline is different: every answer traces to a source evidence node, so an auditable compliance matrix falls out for free.² And the numbers-out discipline from Chapter 6 matters doubly in this domain. Keep the exact figures — prices, dates, certificate numbers — at the source; let the graph hold the relationships and route to the authoritative document for the digit. A stale price baked into a graph is a lost tender. A routed price is correct by construction.

Mini-case: the recertification

Requirement REQ007 — "demonstrate AS/NZS 4801 compliance" — is ingested as a claim with its sub-clauses and a stable ID. A current OHS management certificate arrives; the Janitor draws REQ007 —satisfied-by→ [[Cert 2025]].

A 2026 certificate lands at recert time. The Janitor fades the 2025 edge as superseded-by, links [[Cert 2026]], and the requirement stays green with its audit trail intact. The next tender's compliance matrix renders REQ007 as satisfied, with a one-click trace to the live certificate — and nobody re-mapped a thing.

Compliance graphs encode what must be true — prescriptive relationships, externally judged. Research is messier: sources disagree, supersede, and contradict each other. Point the same loop at a contested corpus and the Janitor's job changes from compaction to reconciliation. That is the last variant.

09

Part III · The Pattern Everywhere

The Research Desk

The hardest case for the doctrine: a corpus that actively disagrees with itself. Here the Janitor's job is not compaction — it is reconciliation, and contradiction becomes a first-class edge.

Two hundred open tabs and forty papers that all say almost the same thing. Three report a key metric at different values. Two have since been quietly superseded. One contradicts the rest and nobody noticed. Ask your RAG about it and it cheerfully retrieves whichever chunk is most similar to your query — and silently drops the disagreement on the floor.³

This is the third and final variant, and it stretches the doctrine to its limit: a corpus that resists having a single coherent worldview. The same engine still runs, but the Janitor's character changes.

The four components, mapped to a research corpus

The loop is the Chapter 3 loop, re-aimed once more. Ingestion takes sources and papers one at a time and writes claims (findings) and edges (agrees-with, contradicts, supersedes, depends-on). Chronological stacking appends findings by date, so a 2026 result floats above a 2023 one and supersession is legible at a glance. The North Star shifts to fit the domain: "maintain a faithful map of what is known, contested, and superseded; favour newer evidence, but never silently delete a contradiction."

The Janitor's job changes: reconciliation, not just compaction

In the inbox of Part II, the Janitor mostly compacted a single coherent worldview. A research corpus is adversarial, so the Janitor has to do something harder: reconcile. It clusters a contested finding, attaches a supersedes edge to the newer result, and keeps the dissent visible rather than smoothing it away.

This is exactly the "lint" pass Karpathy describes — scanning for "contradictions between pages, stale claims that newer sources have superseded, orphan pages with no inbound links."¹⁰ In the earlier chapters the lint was a safety net (Chapter 6). Here it is the main event. And it has academic pedigree: the reflection pass in "Generative Agents" synthesised higher-order structure from scattered observations¹³ — the research Janitor synthesises the shape of the evidence from scattered findings.

Contradiction-as-edge is the differentiator

A chunk store buries disagreement. Ask it a question and it returns the single most-similar passage, dropping the three that disagree — which means it can be confidently, invisibly wrong.² The graph does the opposite. It makes disagreement a node you can navigate: a [[Contested: Metric M]] cluster with edges to each side and a supersedes link to the current best estimate. You retrieve the state of the debate, not a lucky chunk.

The numbers-out rule from Chapter 6 applies here too, with a research twist. Keep claims directional in the graph; route to the paper for the exact statistic and its confidence interval. A graph that asserts "M = 0.42" is brittle and quietly authoritative about something genuinely uncertain. A graph that says "M is contested; current best estimate per the 2026 study — see source" is both more honest and more useful.

Mini-case: the contested metric

Three papers report metric M at 0.31, 0.42, and 0.55. Each is ingested as a dated finding. On a Janitor pass, they are clustered into [[Contested: Metric M]], linked to one another. When a rigorous 2026 study arrives, the Janitor draws a supersedes edge from it to the two older results — without deleting them.

Now the query "what is M?" returns: "contested historically; current best estimate around 0.5 per the 2026 study; the older 0.31 result is superseded — see sources." Not a single similar-looking chunk presented as settled fact.

Three domains — a personal worldview, a team's institutional memory, a contested literature — and one engine underneath all of them. The only question left is the one you can answer this fortnight: does it beat your RAG on a question that actually matters to you? That is where we finish.

10

Part III · The Pattern Everywhere

Start Here

Don't take my word for it. The smallest honest test that proves — or disproves — the whole doctrine on your own corpus, in two weeks.

Pick the one dependency-shaped question your RAG keeps fumbling — the one where it returns three chunks that each mention a piece and none of them answer the relationship.³ You felt that question in Chapter 1, and you have probably had a specific one in mind ever since. Good. We are going to settle it.

The two-week test

This is a sequence, not a calendar. Do not turn it into a four-week project plan; the whole point is that it is small.

1 · Pick one failing question

Real and dependency-shaped — "how does policy X interact with exception Y given contract Z?" Not a single-paragraph lookup. This is the question type RAG structurally misses (Chapter 1).

2 · Hand-build the five or six pages it needs

Write them as claims and edges, with links between them. This is pre-processing, done manually, once (Chapters 2–3).

3 · Ask both

Put the question to your existing RAG and to the model with those pages in front of it. Compare answer quality, token cost, and the number of search round-trips.

4 · If the gap is real, automate the loop

Wire an Ingestion Agent over a slice of the corpus, add a Janitor at a claims threshold, give both one North Star, and watch retrieval collapse to a single parallel page-pull (Chapter 3).

The minimal loop — a starter checklist

When you are ready to automate, you are assembling the same four components from Chapter 3, plus two disciplines from Chapter 6. Keep it this short:

Source intake: one source at a time; read-then-write (edit the map, don't append blindly).
Claim/edge schema: atomic dated claims; typed edges (related, supersedes, [[Entity]]).
Janitor trigger: a page-size or roughly twelve-claim threshold; jobs are combine, fade, convert-to-edge, spin-off.
North Star sentence: one directive encoding purpose plus a recency preference.
Chronological append: date-ordered, oldest on top — decay for free.
Keep numbers out of the graph; route to source for hard figures. Add a periodic lint pass before you trust it unattended.¹⁰

What success feels like

If it works the way it worked for me, the change you notice is not "better search results." It is that the system stops re-introducing itself. It starts the next question already knowing your world — because the knowing was compiled in, ahead of time, and kept current by a janitor that never sleeps.¹⁰ The well-read stranger from Chapter 1 finally becomes a colleague.

For a team, this is how you build an Org Brain that compounds — not through model access, and not through a fine-tuning run you cannot inspect, but through a self-maintaining map your team owns, audits, and carries across tools. You rent the model. You own the map. Build the thing that compounds.

Your move

Tell me the one dependency-shaped question your RAG keeps fumbling, and I'll show you where a wiki-graph would change the answer.

Or save this for the next time an "AI that knows our domain" turns out to be a well-read stranger.

RAG searches your data. A self-cleaning wiki-graph already knows it. The difference is just when you decided to think.

REF

Sources & Evidence

References & Sources

The evidence base behind every claim — primary research, industry analysis, and technical specifications

Research Methodology

This ebook draws on primary research from standards bodies, independent research firms, enterprise technology vendors, and consulting firms. Statistics cited throughout have been cross-referenced against primary sources.

Frameworks and interpretive analysis developed by Scott Farrell / LeverageAI are listed separately below — these represent the practitioner lens through which external research is interpreted, and are not cited inline to avoid self-promotional appearance.

Primary Research & Standards Bodies

Edge et al., Microsoft Research (arXiv:2404.16130) — From Local to Global: A Graph RAG Approach to Query-Focused Summarization [1]

Baseline RAG fails on global questions across a corpus

https://arxiv.org/abs/2404.16130

Microsoft Research Blog — GraphRAG: Unlocking LLM discovery on narrative private data [2]

Baseline RAG struggles to connect the dots across shared attributes

https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

Microsoft Azure Architecture Center — Develop a RAG solution: Chunking phase [3]

Fixed-size chunking not recommended where semantic understanding matters

https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag/rag-chunking-phase

Liu et al. 2023 (arXiv:2307.03172) — Lost in the Middle: How Language Models Use Long Contexts [4]

U-shaped accuracy; degradation when relevant info is mid-context

https://arxiv.org/abs/2307.03172

Modarressi et al., ICML 2025 (arXiv:2502.05167) — NoLiMa: Long-Context Evaluation Beyond Literal Matching [5]

11 models below 50% at 32K; GPT-4o 99.3 to 69.7

https://arxiv.org/abs/2502.05167

Chroma Research 2025 — Context Rot: How Increasing Input Tokens Impacts LLM Performance [6]

All 18 frontier models degrade as input length grows

https://www.trychroma.com/research/context-rot

Andrej Karpathy (GitHub Gist, April 2026) — LLM Wiki [10]

LLM incrementally builds a persistent wiki; knowledge compiled once, kept current, not re-derived per query

https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f

Guo et al. (arXiv:2410.05779, EMNLP 2025) — LightRAG: Simple and Fast Retrieval-Augmented Generation [12]

Dual-level retrieval; considerable accuracy and efficiency gains

https://arxiv.org/abs/2410.05779

Park et al., UIST 2023 (arXiv:2304.03442) — Generative Agents: Interactive Simulacra of Human Behavior [13]

Reflection synthesises higher-order memories; removing it degrades long-horizon coherence

https://arxiv.org/abs/2304.03442

Anthropic Engineering — Effective Context Engineering for AI Agents [14]

Agentic memory / structured note-taking; runtime exploration slower than pre-computed retrieval

https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

Anthropic — Managing context on the Claude Developer Platform [15]

File-based memory persists knowledge across sessions

https://www.anthropic.com/news/context-management

Industry Analysis & Vendor Research

Morphik — RAG in 2025: 7 Proven Strategies to Deploy RAG at Scale [7]

Advanced RAG ~2.2x token cost and added latency vs standard; Top-50 retrieval uses 13.5k tokens and 32.8s latency

https://www.morphik.ai/blog/retrieval-augmented-generation-strategies

jduncan.io (2025) — MCP Context Bloat Analysis [9]

50,000+ token baseline before agent interaction in multi-server MCP

https://jduncan.io/blog/2025-11-07-mcp-context-bloat/

Primary Research & Standards Bodies

Yang et al. (arXiv:2602.05728) — CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering [8]

Multi-hop RAG alternates retrieval and reasoning at each step, causing repeated LLM calls and high token consumption; CompactRAG instead reads the corpus once into an atomic QA knowledge base

https://arxiv.org/abs/2602.05728

Bao & Shi (arXiv:2603.16415) — IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time [11]

Shifts cross-document reasoning from online inference to offline indexing; +4.6 F1 over Naive RAG with single-pass retrieval and a single LLM call, beating graph baselines including HippoRAG

https://arxiv.org/abs/2603.16415

LeverageAI / Scott Farrell — Practitioner Frameworks

The interpretive frameworks, architectural patterns, and practitioner analysis in this ebook were developed through enterprise AI transformation consulting. The articles below are the underlying thinking behind those frameworks. They are listed here for transparency and further exploration — not cited inline, as this is the author's own analytical voice.

Scott Farrell, LeverageAI — Why Code-First Agents Beat MCP

Context-bloat lineage; code beats MCP

https://leverageai.com.au/why-code-first-agents-beat-mcp-by-98-7/

Scott Farrell, LeverageAI — Context Engineering / The Cognition Supply Chain

Routing index and information scent

https://leverageai.com.au/context-engineering-why-building-ai-agents-feels-like-programming-on-a-vic-20-again/

Scott Farrell, LeverageAI — Worldview Recursive Compression

Frameworks as source code; outputs are regenerable artefacts

https://leverageai.com.au/worldview-recursive-compression-how-to-better-encompass-your-worldview-with-ai/

Scott Farrell, LeverageAI — The Intelligent RFP

Compliance-first, requirement-evidence graph beats generic chunking

https://leverageai.com.au/the-intelligent-rfp-proposals-that-show-their-work/

Scott Farrell, LeverageAI — The Cognition Supply Chain

Architecture compounds; kernel flywheel

https://leverageai.com.au/the-cognition-supply-chain-from-search-to-compounding-agentic-cognition/

Scott Farrell, LeverageAI — Markdown OS

An agent is a folder of plain-text files; inspectable, diffable, portable

https://leverageai.com.au/markdown-as-an-operating-system/

Scott Farrell, LeverageAI — A Blueprint for Future Software Teams (Org Brain)

Org Brain: learning lives in the repo as soft weights, not model weights

https://leverageai.com.au/a-blueprint-for-future-software-teams/

About This Reference List

Compiled June 2026. All URLs verified at time of compilation. Regulatory documents and standards specifications are subject to revision — check primary sources for the most current versions.

Some links to academic papers and vendor research may require free registration. Government and standards body publications are freely accessible.

The Index Is the Data

The argument in three lines

The Re-Reading Tax

What RAG is, and where it is genuinely good

Failure one: there is no slot for relationships

Failure two: the "retrieve more" reflex backfires

The tax, paid twice

A scene you will recognise

The Index Is the Data

Move the work before the question

The artefact: claims and edges, not chunks

RAG vs. the wiki-graph (the reference table for this book)

"The index became the data"

One level up: outputs are cheap, the source is the asset

The Dual-Agent Engine

The engine, in one picture

Component A — the Ingestion Agent (the builder)

Component B — the Janitor Agent (the compactor)

Component C — North Star, not rules

Component D — chronological stacking (decay for free)

Two agents, one North Star

The Economics of Knowing

Pre-compute once vs. re-search forever

Freshness versus stability — resolved, not traded

Architecture compounds; models do not

And you own all of it

The shape of the spend

The Build

The North Star that ran it

Ingestion in practice

The Janitor in practice

One email's journey through the loop

What made it feel different

The Payoff and the Fallback

One deliberate exclusion: keep the numbers out

The honest part

Hybrid, not holy war

The Team Org Brain

The four components, mapped to a team

Ingestion

Janitor

North Star

Chronological stacking

The Janitor is an automated Learning Extraction Ritual

Why chunks fail here specifically

Mini-case: the Friday deploy

Compliance and Proposals

Why chunks fail here, specifically

The four components, mapped to requirement–evidence

The self-maintaining angle is the new contribution

Traceability is the payoff — not speed

Mini-case: the recertification

The Research Desk

The four components, mapped to a research corpus

The Janitor's job changes: reconciliation, not just compaction

Contradiction-as-edge is the differentiator

Mini-case: the contested metric

Start Here

The two-week test

1 · Pick one failing question

2 · Hand-build the five or six pages it needs

3 · Ask both

4 · If the gap is real, automate the loop

The minimal loop — a starter checklist

What success feels like

Your move

References & Sources

Research Methodology

Primary Research & Standards Bodies

Industry Analysis & Vendor Research

Primary Research & Standards Bodies

LeverageAI / Scott Farrell — Practitioner Frameworks