AI Architecture · Retrieval

The Index Is the Data: How a Self-Cleaning Wiki-Graph Out-Thinks RAG

📘 Want the complete guide?

Stop searching your data. Pre-digest it. Why the win isn’t a better search at query time — it’s moving the thinking to a self-maintaining map that builds itself before the question.

Scott Farrell · LeverageAI
~11 min read

TL;DR

RAG does its thinking at the wrong time. Searching raw chunks at query time can’t represent the relationships between things — and stuffing more chunks into context makes answers worse, not better.
Move the intelligence off-cycle. Pre-process unstructured sources into a self-maintaining wiki-graph of claims and edges. Two agents do it: an Ingestion Agent that builds, and a Janitor Agent that compacts claims into edges so the index gets smaller as it gets smarter.
Then retrieval is a lookup, not a crawl. In a multi-year build, answering a question meant pulling ~6 wiki pages in one parallel call — accurate enough that the raw email and RAG tools went mostly unused. The index became the data.

Your RAG system re-reads your entire world every time you ask it a question. Ask it something on Monday, it searches. Ask the same shape of question on Tuesday, it searches again — from scratch, as if it had never seen your corpus before. It is a very well-read stranger who re-introduces themselves at every meeting.

A system that actually knew your world wouldn’t do that. It would pull half a dozen pages it already understood, see how they connect, and answer. That’s not a hypothetical. I ran exactly that system for a couple of years over my own Gmail, and on most questions it never needed to touch the raw email at all.

This article is about why that works — and why it’s not a better retriever, a bigger model, or a fancier vector database. It’s a different idea about when the thinking happens.

Part 1 — The re-reading tax

Retrieval-Augmented Generation is the default way teams put an LLM on top of their own documents: chunk the corpus, embed the chunks, and at query time fetch the top-k most similar pieces and hand them to the model. For “find me the paragraph about X,” it’s genuinely good. The trouble starts the moment your questions stop being about a single paragraph and start being about how things relate.

Chunking is the first crack. Fixed-size splits cut straight through the logic of a document. Microsoft’s own Azure architecture guidance is blunt about it: fixed-size chunking “isn’t recommended for text that requires semantic understanding and precise context,” because “relevant context that exists across multiple chunks might not be captured.”¹ Each chunk becomes an island, divorced from its place in the larger argument.

The second crack is deeper and structural. Vector similarity matches text that looks like your query. It has no representation for “X is defined in document A, constrained by document B, and exceptioned by document C.” That’s why Microsoft built GraphRAG in the first place — their research team states plainly that baseline RAG “fails on global questions directed at an entire text corpus,” because that “is inherently a query-focused summarization task, rather than an explicit retrieval task.”² Baseline RAG, in their words, “struggles to connect the dots… when answering a question requires traversing disparate pieces of information through their shared attributes.”³

“Graph-type systems are much better when the relationship matters more than the data. When the raw data itself is what you need, RAG can perform well. But generally, anything that gives the AI a map — a framework, a graph — helps it understand where to find information, and how, all at once.”
— the working principle behind the build

The instinct, when retrieval misses, is to retrieve more — widen top-k, stuff more context in. This is exactly backwards. Long context degrades in well-documented ways. The classic “Lost in the Middle” study showed model accuracy is highest when the relevant information sits at the very start or end of the context and sags when it’s buried in the middle.⁴ It has only gotten more measurable since: the NoLiMa benchmark (ICML 2025) found that at 32K tokens, 11 leading models dropped below 50% of their own short-context baseline — even GPT-4o fell from an almost-perfect 99.3% to 69.7%.⁵ Chroma’s testing across 18 frontier models found the same shape: “model performance degrades as input length increases, often in surprising and non-uniform ways.”⁶

So the multi-search RAG loop pays a tax twice. Discovery is slow and token-heavy — many round-trips to assemble one answer. And the payload it assembles actively dilutes the model’s attention. As one of my own notes on this put it: giving an agent search tools, even good ones like RAG, means a lot of circumstances where discovery is slow, token-bloated, and returns stuff you didn’t want — and it takes many searches just to understand the relationships.

(If you’ve also hit context bloat from the tooling side — a stack of MCP servers eating 50,000+ tokens before the agent thinks a single thought¹² — that’s a related but separate problem. We’ve written about why code-first agents beat MCP on context and about routing indexes and “information scent” elsewhere. This piece is about what happens upstream of retrieval entirely.)

Part 2 — The shift: the index is the data

Here’s the reframe. The reason RAG re-derives understanding on every question is that the understanding was never built. The work of seeing how your domain connects is deferred to query time and done badly, under latency pressure, by similarity math that can’t see relationships.

So move the work. Do it before the question, off-cycle, once per source — and bake the result into a structure that holds relationships natively. You aren’t just indexing raw data; you’re pre-digesting it into a conceptual worldview. The intelligence gets injected before the user ever asks anything.

Pre-processing is pre-thinking. You turn unstructured data into connections ahead of time, and you bake that into the edges of a map. The index stops being a pointer to the data. The index becomes the data.

This is the same idea Andrej Karpathy named publicly in April 2026 with his “LLM Wiki” pattern. His framing: “the LLM incrementally builds and maintains a persistent wiki — a structured, interlinked collection of markdown files,” so that, in contrast to retrieving from raw documents at query time, “the knowledge is compiled once and then kept current, not re-derived on every query.”⁷ I’d been running a version of this for years before there was a name for it, and the thing I’d add is this: the pre-processing step is almost as important as the wiki itself.

It’s worth being precise about the artefact, because “wiki” undersells it. A wiki page is prose. What makes this work is that the page is a set of claims and edges — atomic statements and the typed links between them. The edges are where the relationships live. And once relationships live in the structure, the model navigates them instead of re-inferring them from chunk similarity every time.

If you’ve read our work on Worldview Recursive Compression, this is the same principle one level down: frameworks function as source code, and generated outputs are regenerable artefacts derived from that source. The wiki-graph is the compiled source-code of your corpus — not a cache of it. The raw emails are the regenerable raw material; the map is the asset.

Why a graph, concretely

This isn’t only a personal-notes trick. Microsoft’s GraphRAG showed “substantial improvements over a conventional RAG baseline for both the comprehensiveness and diversity of generated answers” on sense-making questions over million-token corpora.² Hybrid graph-plus-vector systems like LightRAG report “considerable improvements in retrieval accuracy and efficiency” by integrating graph structure into indexing and using dual-level retrieval.¹⁰ We saw the same thing in our own production work on AI-assisted RFPs: structuring requirements and evidence as a graph — rather than chunking the source documents — was the difference between near-miss retrieval and traceable, relationship-correct answers. A self-maintaining wiki-graph generalises that insight from one workflow to your whole domain.

RAG vs. the wiki-graph

Dimension	Standard RAG	Wiki-graph
Primary mechanism	Vector similarity on raw text chunks	Parallel map lookups over semantic relationships
Token efficiency	Bloated; returns noise alongside signal	Lean; reads pre-distilled conceptual pages
Relationship-aware	Poor; needs many sequential searches to link ideas	Native; connections baked into the edges
Retrieval pattern	Slow, multi-step tool-calling loops	One parallel call to a handful of precise pages
When the work happens	At query time, every time	Once, off-cycle, then kept current

Part 3 — The dual-agent engine

A map that you have to maintain by hand is a liability — it goes stale the day you stop tending it. The trick is to make the maintenance itself automatic, and to let it run on two agents with the same toolset but opposite jobs.

[ New source ]

      │

      ▼

┌──────────────────────┐     reads current wiki,

│  1. INGESTION AGENT  │ ──► writes/updates claims

│      (the builder)   │     and edges

└──────────────────────┘

      │

      │   (page crosses ~12 claims / size threshold)

      ▼

┌──────────────────────┐     combines claims,

│  2. JANITOR AGENT    │ ──► fades stale ones,

│    (the compactor)   │     converts claims → edges,

└──────────────────────┘     spins off new pages

      │

      ▼

[ Leaner, smarter wiki-graph ] ──► next source starts from here

The Ingestion Agent is the builder. It processes inputs one at a time — in my case, one email at a time — reads the current state of the wiki, and then updates or spawns claims, edges, and pages. Single-source ingestion matters: each source is small enough to reason about properly, and a single source typically touches a handful of pages, not the whole graph. (Karpathy’s version notes a single source “might touch 10–15 wiki pages.”⁷)

The Janitor Agent is the compactor. Left alone, an append-only wiki degrades into a text dump. So a second agent watches for page bloat — once a page crosses about a dozen claims — and does the unglamorous work: it combines redundant claims, fades out stale ones, spins clusters off into their own pages, and, most importantly, converts flat claims into edges. “Met with X on Tuesday” stops being a standalone claim and becomes a link to [[Project X]]. The whole point is to reduce the number of outright claims, so that more of the information is held in the edges — in the connections.

That last move is the one people miss, so it’s worth saying flatly:

The index didn’t just get smaller. It got smarter with every janitorial pass.

This is not an exotic idea — it’s the oldest idea in cognition, automated. In the “Generative Agents” research, the agents that stayed coherent over long horizons were the ones that periodically reflected: clustering scattered observations into higher-order, abstract memories and storing those back in place of the raw log.⁹ Remove the reflection step and the agents degenerated. The Janitor is that reflection pass, running on your corpus. It mirrors what human memory does during sleep — transferring messy short-term logs into compact long-term relational structure.

Part 4 — Decay for free, and governance by North Star

Two design choices make the loop stable without a single line of brittle rule-code.

Chronological stacking

Claims are appended in date order, so the oldest float to the top of the page. That’s it — that’s the whole temporal-decay mechanism. There are no timestamp-delta calculations, no TTL fields, no expiry jobs. By putting older claims at the top, you’ve handed the model a timeline view: it reads top-to-bottom, sees how a thought or a project evolved, and can tell at a glance when an older claim has been superseded by a newer one below it. The Janitor uses that ordering to decide what to compress into history and what to keep live.

North Star, not rules

The agents aren’t governed by a long list of prescriptive instructions. They’re governed by a single directive — a North Star. Mine was: act as my personal assistant building a consistent worldview; favour recent information over old. That one sentence does more than a page of rules, because it lets the agent exercise judgment on the edge cases a rulebook can’t enumerate: is this old claim noise to delete, or context to abstract into an edge?

Using the LLM’s latent reasoning to handle data decay — instead of writing brittle, deterministic code to calculate timestamp deltas — is exactly how we should be building agentic systems.
— the heuristic-over-rule principle

Give the Janitor the same tools as the Ingestion Agent (create pages, create edges, combine, delete) and the same North Star, and it performs genuine executive function: it autonomously decides whether an old claim is irrelevant noise to be deleted, or historical context that should be abstracted into a connection. You’re not maintaining the map. The map maintains itself, and gets sharper each pass.

Part 5 — The payoff

Here’s what all of that buys you at the moment of the question. When the retrieval agent went looking, it requested about half a dozen wiki pages in one parallel call — and it was so accurate that it didn’t need to follow up with any email searches. It suddenly knew exactly which connections it needed, really quickly. A five-search crawl collapsed into a single lookup.

And the kicker: even though I gave that agent full access to read the original email, search the email database, and run a RAG search, it basically didn’t need those tools. It answered a high percentage of questions straight from the wiki. The index had become the data. You’ve essentially built a custom semantic caching layer that mirrors long-term human memory consolidation.

This tracks with where the broader field has landed. Anthropic, writing on context engineering, describes “structured note-taking, or agentic memory” — where “the agent regularly writes notes persisted to memory outside of the context window” — as a first-class pattern, and is explicit about the trade-off that makes the whole argument: “runtime exploration is slower than retrieving pre-computed data.”⁸ That sentence is the entire thesis in nine words. Pre-compute the understanding, and retrieval is cheap.

One deliberate exclusion: keep the numbers out

One counter-intuitive rule made retrieval sharper: I told the agents not to record actual numbers and figures in the wiki. So when I genuinely needed a hard metric, I did have to go back to the email for it — but the wiki knew exactly which source to point at. The graph optimises for worldview and relationships; precise figures are a different kind of data with a different failure mode (a stale number is a wrong answer; a stale relationship is usually still directionally true). Keep the graph for relationships, and let it route you to the source for the digits.

Part 6 — And you own all of it

The artefact is plain markdown. That’s not an aesthetic preference; it’s the ownership story. A markdown wiki is inspectable by a human, diffable in Git, portable across model providers, and not locked inside any vector database. When the AI gets something wrong, you can open the file and see why. When the Janitor makes a bad consolidation, you can revert the commit. It behaves like soft weights — it conditions how the AI reasons about your domain, the way fine-tuning would, but you can read it, edit it, and review the last pass in a pull request.

This is the Markdown OS pattern we’ve written about — an agent is a folder of plain-text files — and the wiki-graph is the single most valuable file in that folder. Anthropic now ships a file-based memory tool on exactly this premise: a directory the model can build up and “reference previous learnings without having to keep everything in context.”¹¹ The durable asset is the map. Every vector index built on top of it is disposable and rebuildable from the text.

When RAG is still the right call (and where this can bite)

This is not “RAG is dead.” For low-latency, single-document, well-formed questions — “find me the clause about X” — vector search is faster and perfectly good. The honest architecture is hybrid: keep RAG as the broad net for fast candidate-fetching, and use the wiki-graph as the relationship map. The raw data, the vector index, and the graph can all coexist; the graph is just where the understanding lives.

And the open risk worth naming: a Janitor compacting under a loose directive can make a hallucinated consolidation — merging two genuinely distinct ideas because they happened to be old and adjacent. The mitigations are exactly the ownership properties above (chronological legibility, a human-auditable diff, instant revert) plus a periodic health-check pass — what Karpathy calls a “lint”: scan for “contradictions between pages, stale claims that newer sources have superseded, orphan pages with no inbound links.”⁷ Self-maintaining doesn’t mean unsupervised. It means the supervision is cheap and legible.

Start here: a two-week test

You don’t need a multi-year project to find out whether this fits your corpus. Run the smallest honest version:

Pick one question your RAG keeps missing — a real, dependency-shaped one (“how does policy X interact with exception Y given contract Z?”), not a single-paragraph lookup.
Hand-build the 5–6 pages it needs. Write them as claims and edges, with links between them. This is the pre-processing, done manually, once.
Ask both. Put the question to your existing RAG and to the model with those pages in front of it. Compare answer quality, token cost, and the number of search round-trips.
If the gap is real, automate the loop. Wire up an Ingestion Agent over a slice of the corpus, add a Janitor at a claims threshold, give both a single North Star, and watch whether retrieval collapses to one parallel page-pull.

If it works the way it worked for me, the change you’ll feel isn’t “better search results.” It’s that the system stops re-introducing itself. It starts the next question already knowing your world — because the knowing was compiled in, ahead of time, and kept current by a janitor that never sleeps. For a team, that’s how you build an org brain that compounds: not through model access or a fine-tuning run, but through a self-maintaining map your AI continuously sharpens.

RAG searches your data. A self-cleaning wiki-graph already knows it. The difference is just when you decided to think.

If you’re running into RAG’s relationship ceiling: tell me the one dependency-shaped question your retrieval keeps fumbling, and I’ll show you where a wiki-graph would change the answer. Drop it in the comments, or save this for the next time your “AI that knows our domain” turns out to be a well-read stranger.

References

Microsoft Azure Architecture Center. “Develop a RAG solution — Chunking phase.” learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag/rag-chunking-phase — “Fixed-size chunking… isn’t recommended for text that requires semantic understanding and precise context”; “relevant context that exists across multiple chunks might not be captured.”
Edge, D. et al. (Microsoft Research). “From Local to Global: A Graph RAG Approach to Query-Focused Summarization.” arXiv:2404.16130. arxiv.org/abs/2404.16130 — RAG “fails on global questions directed at an entire text corpus… inherently a query-focused summarization task, rather than an explicit retrieval task”; “substantial improvements over a conventional RAG baseline for both the comprehensiveness and diversity of generated answers.”
Microsoft Research Blog. “GraphRAG: Unlocking LLM discovery on narrative private data.” microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/ — “Baseline RAG struggles to connect the dots… when answering a question requires traversing disparate pieces of information through their shared attributes.”
Liu, N. et al. (2023). “Lost in the Middle: How Language Models Use Long Contexts.” arXiv:2307.03172. arxiv.org/abs/2307.03172 — accuracy is highest when relevant information is at the beginning or end of the context and degrades when it is positioned in the middle; replicated across multiple model families.
Modarressi, A. et al. (2025). “NoLiMa: Long-Context Evaluation Beyond Literal Matching.” arXiv:2502.05167 (ICML 2025). arxiv.org/abs/2502.05167 — “At 32K… 11 models drop below 50% of their strong short-length baselines”; “Even GPT-4o… experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.”
Chroma Research (2025). “Context Rot: How Increasing Input Tokens Impacts LLM Performance.” trychroma.com/research/context-rot — across 18 frontier LLMs, “model performance degrades as input length increases, often in surprising and non-uniform ways.”
Karpathy, A. (April 2026). “LLM Wiki.” GitHub Gist. gist.github.com/karpathy/442a6bf555914893e9891c11519de94f — “the LLM incrementally builds and maintains a persistent wiki”; “The knowledge is compiled once and then kept current, not re-derived on every query”; lint pass checks for “contradictions between pages, stale claims that newer sources have superseded, orphan pages with no inbound links.”
Anthropic (2025). “Effective Context Engineering for AI Agents.” Anthropic Engineering. anthropic.com/engineering/effective-context-engineering-for-ai-agents — “Structured note-taking, or agentic memory, is a technique where the agent regularly writes notes persisted to memory outside of the context window”; “runtime exploration is slower than retrieving pre-computed data.”
Park, J. S. et al. (2023). “Generative Agents: Interactive Simulacra of Human Behavior.” arXiv:2304.03442 (UIST 2023). arxiv.org/abs/2304.03442 — agents periodically generate “reflections,” higher-level abstract memories synthesized from observations and stored back into the memory stream; removing reflection degrades long-horizon coherence.
Guo, Z. et al. (2024). “LightRAG: Simple and Fast Retrieval-Augmented Generation.” arXiv:2410.05779 (EMNLP 2025). arxiv.org/abs/2410.05779 — integrates graph structures into indexing with “a dual-level retrieval system,” reporting “considerable improvements in retrieval accuracy and efficiency compared to existing approaches.”
Anthropic (2025). “Managing context on the Claude Developer Platform.” Anthropic News. anthropic.com/news/context-management — a file-based memory tool lets the model “build up knowledge bases over time… and reference previous learnings without having to keep everything in context.”
Duncan, J. (2025). “MCP Context Bloat Analysis.” jduncan.io/blog/2025-11-07-mcp-context-bloat/ — multi-server MCP deployments carry a “50,000+ token baseline before agent interaction”; “GitHub MCP server alone consumes 55,000 tokens for 93 tools.” (Cited for context-bloat calibration only.)

Scott Farrell writes about AI architecture and governance at LeverageAI. The Gmail wiki-graph described here is a personal, multi-year build; the retrieval results are from that project, not a published benchmark. Frameworks referenced — Worldview Recursive Compression, the Cognition Supply Chain, Markdown OS — are at leverageai.com.au.

Discover more from Leverage AI for your business

Subscribe to get the latest posts sent to your email.

The Index Is the Data: How a Self-Cleaning Wiki-Graph Out-Thinks RAG

The Index Is the Data: How a Self-Cleaning Wiki-Graph Out-Thinks RAG

TL;DR

Part 1 — The re-reading tax

Part 2 — The shift: the index is the data

Why a graph, concretely

RAG vs. the wiki-graph

Part 3 — The dual-agent engine

Part 4 — Decay for free, and governance by North Star

Chronological stacking

North Star, not rules

Part 5 — The payoff

One deliberate exclusion: keep the numbers out

Part 6 — And you own all of it

When RAG is still the right call (and where this can bite)

Start here: a two-week test

References

Related

Discover more from Leverage AI for your business

Leave a Reply Cancel reply

Terms of Use