AI Architecture · LeverageAI · Model Mechanics

Why LLMs Can Walk a Wiki but Can’t Drive a RAG

Your agent explores a wiki cleanly and thrashes doing RAG searches — repeating itself, missing the obvious, unsure when to stop. It isn’t the prompt and it isn’t the model. Following a named link is the most in-distribution thing an LLM ever does; querying an embedding space it can’t see is the opposite. And the map holds the one thing the model is worst at holding: where it has and hasn’t been.

Scott Farrell · LeverageAI · A field note for people building agents

The short version

The observation: point the same cheap model at a wiki and at a RAG index. On the wiki it never asks for the same thing twice. On RAG it loops — three queries, the same chunk back twice, and it still can’t tell whether it’s done.
The mechanism: link-following is in-distribution (LLMs were pretrained on the hyperlinked web), and the map externalises the navigation — visited pages sit in the transcript, unvisited territory is spelled out as named edges. RAG forces the model to guess query strings against a space it can’t see, and to keep “what have I already seen” in its head.
The consequence: a low-thinking model walks a good map accurately, because the navigation-thinking was moved onto the artifact. Redundancy in the RAG loop is structural — the top-k nearest chunks are, by construction, near each other.

I run an agent that reads my email on a cheap, fast model — a mini, on low reasoning effort — and it does a genuinely good job, because most of the intelligence lives in a wiki I built, not in the model. The same model, pointed at a plain vector search over the same material, falls apart in a way that’s almost comic to watch: it fires off a query, gets a fistful of chunks, fires off a slightly reworded query, gets most of the same chunks back, reasons about whether to try again, tries again, and eventually stops — not because it’s confident it’s covered the ground, but because it’s run out of ideas for what to type. On the wiki it behaves like someone who knows the building. On RAG it behaves like someone shouting into a warehouse and grabbing whatever echoes back. Same model. Same knowledge. The difference is entirely in the shape of the retrieval — and once you see why, you stop reaching for a smarter model or a better prompt to fix it.

The thing you keep noticing

Anyone who has watched an agent do both will have seen the asymmetry, even if they haven’t named it. Give the agent a wiki — a set of pages with named, typed links between them — and its exploration is clean. It reads a page, follows a link to the next, follows another, and stops. It doesn’t re-open a page it just read. It doesn’t circle. The path looks like a person browsing.

Give the same agent a RAG index over the same corpus and the behaviour changes character. It issues a query, gets back the top handful of chunks, then has to decide what to do next with no map of what it just saw or what it missed. So it guesses another query. Much of what comes back is what it already had. It has no reliable way to know whether the important thing is in the pile or three queries away, so it either quits early and misses it, or keeps grinding and repeats itself. The redundancy is the tell, and it’s the same redundancy every time.

The lazy explanation is “the model isn’t smart enough” or “the prompt needs work.” But the cheap model was smart enough for the wiki, with the same prompt. Something about RAG specifically is fighting the model, and something about walking a wiki specifically is with the grain. Both halves have a mechanical cause.

The same question, two ways

Here is a real question I actually ask my own system, because I write job applications with it: “Where have I built voice AI, and how does it connect to my agent frameworks?” Watch the same model answer it against a wiki and against a RAG index over the identical set of project notes and articles.

Walking the wiki

Reads the map that’s injected at the start — root pages plus one-line descriptions. It can already see there’s a voice-ai page and a lane-doctrine framework page.
Opens voice-ai. The page is a summary written to be read, and it carries typed edges: → twilio-ivr project, → openai-realtime project, → lane-doctrine (fast–slow split), → context-arbitrage.
Follows the two project edges. Both pages are new. Now it has the implementations.
Follows the edge to lane-doctrine. That’s the other half of the question — how voice connects to the frameworks — handed to it as a named link, not something it had to think to look for.
Stops. The map shows nothing unvisited still carrying scent. Five reads, five unique pages, zero repeats — and both halves of the question answered.

Driving the RAG loop

Query "voice AI implementation" → top-8 chunks. Five are near-duplicate passages from the same Twilio README; three are generic mentions of “voice.” Call it three distinct ideas out of eight.
It can’t see what it missed, so it rewords: "realtime voice agent" → top-8. Six overlap the first pull; two are new.
Still unsure, it tries "voice bot code" → top-8. Seven overlap; one new.
The framework connection — the fast–slow split — never surfaces. No chunk sits near “voice AI implementation” in embedding space, so half the question is silently dropped, and RAG has no way to report that it dropped it.
Stops — not because it’s covered the ground, but because it’s out of phrasings. Twenty-four chunks retrieved, nine distinct, fifteen redundant (≈63% waste), and the question half-answered.

Line the two up and the difference isn’t quality of reasoning; it’s where the reasoning had to happen.

Same model, same corpus	Wiki walk	RAG loop
Retrievals	5 page reads	3 queries → 24 chunks
Distinct content	5 unique	9 unique
Redundant pulls	0	15 (≈63%)
Coverage of the question	Both halves	Frameworks half dropped, silently
Why it stopped	No unvisited edge left with scent	Ran out of phrasings to guess

Two mechanics produce that whole table. The first is what the model was trained to do. The second is where the “what have I seen” bookkeeping lives.

Following a link is in-distribution. Guessing a query is not.

Start with pretraining, because that’s where an LLM’s instincts come from. These models were built by ingesting an enormous quantity of the web — and the web is a hyperlinked document. Some of the most influential training corpora were literally assembled by following links: the crawl starts from a page and walks its outbound links to the next page. A language model spends its formative existence reading documents that were written for a human to read, threaded together by named links that a human was expected to follow. So when you hand a running model a page written to be read and a set of named links to choose from, you are asking it to do the single most rehearsed action in its entire training history. There is no genre it has seen more of.

Now look at what RAG asks instead. To retrieve, the model has to emit a query string that will land near the right chunks in a high-dimensional embedding space — a space it cannot see. It has no view of the vectors, no view of what’s stored, no feedback on why a query returned what it did. It is guessing the magic words for a lock whose mechanism is invisible to it. Nothing in pretraining rehearses that; people don’t write “here is the query that would retrieve me” on their documents. The task is out-of-distribution, so the model does what models do off-distribution: it flails, plausibly.

Following a named link is reading a signpost. Writing a query is guessing a password — for a lock you’re not allowed to look at.

You don’t have to take this as theory; the industry has already voted with its coding agents. Claude Code deliberately ships with no vector index over your repo — it navigates the file tree agentically, reading and following references, and accepts spending more tokens to do it, on the bet that walking beats querying for a model.⁴ Aider builds a “repo map” — a structural graph of your code ranked by importance, with no embeddings at all — and hands the model that map to navigate.¹ The index-first camp, Cursor being the clearest example, chunks the code, embeds it, and stores the vectors for similarity search;² and the benchmarks on that camp are a quiet confession — retrieval recall swings from 42% to 70% depending on how you chunk, quality peaks at five to ten chunks and degrades past a couple of thousand tokens, and hybrid search buys another 15–30%.³ Those are all knobs the model can’t see and didn’t set. Meanwhile Cognition’s DeepWiki and Karpathy’s LLM-wiki both point the other way entirely: pre-build a navigable, cross-linked wiki and let the agent (and the human) explore it by following links.⁵⁶ The whole field is rediscovering that the retrieval the model handles best is the one that looks like the web it was trained on.

The map holds the state the model is worst at holding

In-distribution explains why each step is easier. It doesn’t yet explain the redundancy — the asking-twice. That comes from a second, separate thing: where the navigation state lives.

Any multi-step search has to track two running facts: what have I already seen, and what’s still out there unseen. Those are exactly the two things a language model is worst at keeping straight across a long context — it has no reliable internal ledger, and things in the middle of a growing transcript blur. A wiki walk never asks the model to hold either one in its head. The pages it has visited are simply in the transcript — already read, already cached, visibly behind it. The territory it hasn’t visited is enumerated on each page as named, typed edges: not “there might be more somewhere,” but “here are the specific unread pages this one points to, and what each is about.” The bookkeeping is on the page, not in the head. The model is left with the one job it’s genuinely excellent at: reading what’s in front of it and deciding which named thing to read next.

RAG — navigation runs in the model’s head

What did I already retrieve? Held in-context, blurrily.
What haven’t I seen? Unknowable — RAG can’t report what it didn’t return.
Am I done? A guess, so it over- or under-searches.
Next move? Invent another query string, blind.

Wiki — navigation runs on the page

Visited = the transcript. Already there, already cached.
Unvisited = named, typed edges carrying scent.
Done = no unvisited edge left worth following.
Next move = pick a signpost that’s already written down.

This is why the redundancy is structural and not a tuning failure you can re-rank your way out of. RAG returns the k nearest chunks to your query. The k nearest chunks to a query are, by construction, also near each other — that’s what “nearest” means. So your second, slightly-reworded query lands in the same neighbourhood and hands back much of the same pile. Top-k similarity is a photocopier with a threshold: ask twice in the same region and you get two copies. The wiki has no equivalent failure because an edge is followed once — the relationship is stored, not recomputed from similarity every time the agent wonders about it.

Why the cheap model stays accurate

Put the two mechanics together and the economics fall out. If every step is in-distribution and the navigation state lives on the artifact, then what’s actually left for the model to do is small and squarely in its wheelhouse: read this page, pick the next link. That is a task a mini model on low reasoning effort can do all day without getting lost. The hard part — deciding what relates to what, and what’s worth reading next — was pre-computed once, by whatever built the map, and frozen into the edges. The model isn’t re-deriving the structure of the domain on every query; it’s walking structure that’s already there.

So the floor on model quality drops. The question stops being “is this model smart enough to reason its way to the right search terms and stitch fragments together” and becomes “can it still walk the map without getting lost” — a far lower bar. That’s the same move I’ve called context arbitrage elsewhere: pay once to compile a good map, then capture the price gap between frontier and utility models on every task afterwards. RAG points the incentive the other way — because the navigation is the model’s problem, a harder corpus quietly demands a smarter (pricier) model to keep the guessing accurate. Walk a good map and a dumb model suffices; drive a RAG and you keep buying intelligence to compensate for a missing map.

The rule this leaves you

You don’t need a smarter model to walk a good map. You need a good map so a dumber model can walk it. Externalise the navigation and the model’s job shrinks to the thing it’s best at — reading.

This is “code beats MCP,” one level up

If this feels familiar, it should. I’ve argued before that code-first agents beat MCP: letting a model write code to call your tools outperforms handing it a wall of tool-schemas, because writing code is something the model saw endlessly in training and calling a bespoke tool protocol is not. This is the same argument, moved from action to retrieval. Meet the model where its training distribution already lives. Writing code > invoking a tool protocol, for acting; following a named link > formulating an embedding query, for retrieving. Same principle, one level up: stop asking the model to operate an interface it never learned, when an interface it knows by heart does the same job.

Where this stops

Keep the claim honest, because it’s easy to over-read. This is not “wikis beat RAG.” It’s a statement about model mechanics — training distribution and where state lives — and it explains a specific, observable pain: multi-step exploration that thrashes and repeats. RAG is genuinely excellent at the job it was built for: fast, single-hop, reflex lookups where one query is the whole interaction and there’s no navigation to track. In the hot path of a live voice call, that speed isn’t a nice-to-have, it’s the only thing that fits the latency budget. The point is not that similarity search is bad; it’s that driving a multi-hop retrieval loop is out-of-distribution work you can hand back to the model or externalise onto a map — and the redundancy tells you which one you chose.

There’s a real and separate question about when RAG is the right substrate to build on in the first place — the cases where a maintained map isn’t worth its upkeep and pure retrieval is the correct call. That’s a boundary worth drawing carefully, and it’s a different article. And the substrate case for a wiki — traversal, write-back, held state, compounding — I’ve made next door, in RAG Was Built for Chatbots. This piece is the complement to that one: not what the substrate can do, but why the model finds one shape native and the other alien. Two different whys for the same instinct. If your agent thrashes on RAG and glides on a wiki, you now have the mechanical reason — and the reason isn’t going to be fixed by a bigger model. It’s going to be fixed by moving the navigation off the model and onto the page.

Building agents that thrash on retrieval?

If your agent loops, repeats itself, and can’t tell when it’s done, that’s rarely a prompt problem and almost never a model problem — it’s a retrieval-shape problem. At LeverageAI we compile a body of work into a navigable wiki-graph first, then point cheap, fast agents at it: in-distribution retrieval, state on the page, and a model floor low enough to run on utility-class tokens. Talk to us about giving your agents a map to walk.

References

[1]Aider — “Building a better repository map with tree-sitter.” Aider ranks code by importance using a graph and PageRank, with no embeddings, and hands the model a structural map to navigate rather than a vector index to query. aider.chat/docs/repomap.html
[2]Engineer’s Codex — “How Cursor indexes codebases fast.” The index-first approach: tree-sitter chunking, embeddings computed and stored in a remote vector database (Turbopuffer), Merkle-tree change detection — retrieval as similarity search over an opaque space. read.engineerscodex.com/p/how-cursor-indexes-codebases-fast
[3]Codebase-retrieval benchmarks (2026 roundup) — AST-aware chunking achieves 70.1% Recall@5 versus 42.4% for fixed-size chunking; retrieval quality peaks at 5–10 chunks and degrades above ~2,500 tokens per chunk; hybrid search improves recall 15–30% over single-method retrieval. All knobs invisible to the model at query time. zylos.ai/research/2026-04-19-codebase-intelligence-repository-understanding-ai-agents
[4]Codebase-intelligence research (2026) — Claude Code deliberately ships without RAG, compensating with agentic file exploration (reported ~40% more tokens): the frontier coding agent walks the tree and follows references rather than querying an index. zylos.ai/research/2026-04-19-codebase-intelligence-repository-understanding-ai-agents
[5]Cognition AI — DeepWiki. Auto-generates a navigable, cross-linked wiki over any repository so developers and agents can understand it by following links and reading pages, rather than querying source. cognition.com/blog/deepwiki
[6]Andrej Karpathy — “LLM wiki” note (April 2026). An agent reads each source and integrates it into an interlinked collection of markdown files; the human browses the results by following [[wikilinks]] and the graph view — link-following as the native interaction with a knowledge base. gist.github.com/karpathy
[7]LeverageAI — related canon (named for context, not statistics): RAG Was Built for Chatbots — Agents Need a Wiki (the substrate-properties sibling to this mechanics argument); Why Code-First Agents Beat MCP (the same meet-the-distribution argument, applied to action); The Index Is the Data (the self-cleaning wiki-graph the map is compiled into); Context Arbitrage (why the compiled map lets a utility-class model do the work). leverageai.com.au

Discover more from Leverage AI for your business

Subscribe to get the latest posts sent to your email.

Why LLMs Can Walk a Wiki but Can’t Drive a RAG

Why LLMs Can Walk a Wiki but Can’t Drive a RAG

The short version

The thing you keep noticing

The same question, two ways

Walking the wiki

Driving the RAG loop

Following a link is in-distribution. Guessing a query is not.

The map holds the state the model is worst at holding

RAG — navigation runs in the model’s head

Wiki — navigation runs on the page

Why the cheap model stays accurate

The rule this leaves you

This is “code beats MCP,” one level up

Where this stops

Building agents that thrash on retrieval?

References

Related

Discover more from Leverage AI for your business

Leave a Reply Cancel reply

Terms of Use