AI Architecture · LeverageAI

📘 Want the complete guide?

RAG Was Built for Chatbots — Agents Need a Wiki

RAG isn’t failing you. It was engineered for the one-shot chatbot turn. Your agents have a different job — they must traverse, write, hold state, and compound — and on every one of those axes a wiki-graph is native and RAG is a mismatch.

Scott Farrell · LeverageAI · A field guide for AI architects & platform engineers

I dropped my car at Tesla for a failing heated seat. The service rep frowned at his screen: the ticket said driver-occupancy sensor, not heated seat. He looked at it, looked at it again, and said, “yeah, I think the AI got that one wrong.” The triage AI had read my complaint and pre-ordered a plausible part. Here’s the thing — the occupancy sensor genuinely is a common upstream cause of heated-seat faults, so the AI wasn’t even being stupid. It just had no memory of the thousands of nearly identical cases that would have told it to stage the heating element too. It had retrieval. It did not have institutional memory. And that gap — not the model, not the prompt — is the thing most teams are about to slam into as they move from chatbots to agents.

For two years, “give the AI memory” has meant one thing: stand up a vector store and retrieve. Retrieval-augmented generation became the default substrate for AI memory so quickly that we stopped asking what it was actually designed to do. So let’s ask. And the honest answer reframes the entire debate — not as RAG versus graphs, but as a tool quietly being run outside the job it was built for.

RAG solved a different problem

Go back to the founding paper. When Lewis and colleagues introduced retrieval-augmented generation in 2020, they framed it precisely: a way to combine “pre-trained parametric and non-parametric memory for language generation,” grounding a single generation in a dense vector index so models “generate more specific, diverse and factual language” on knowledge-intensive tasks.¹ Read that as an engineer. The unit of work is one generation. The shape is retrieve → generate → done. There is no notion of traversing relationships across steps, writing anything back, holding state between turns, or getting smarter over successive runs. There didn’t need to be — because the job was the chatbot turn.

RAG is genuinely excellent at that job. Fast, cheap, deterministic; one embedding call and a similarity search and you have grounded context in milliseconds. If your workload is single-shot question answering — “what does our refund policy say?” — RAG is the right tool and this article is not arguing otherwise.

But the workload is moving. Anthropic draws the line cleanly: agents are systems where “LLMs dynamically direct their own processes and tool usage,” used for “open-ended problems where it’s difficult or impossible to predict the required number of steps” — and where “the LLM will potentially operate for many turns.”² That isn’t the chatbot turn; it’s the opposite of it. And it’s already here: LangChain’s platform telemetry shows the average number of steps per trace more than doubling in a single year — from 2.8 to 7.7 — as tool-using traces went from a rounding error to a fifth of all activity.³ Their survey of 1,300+ practitioners found about 51% already running agents in production, with performance quality — not cost, not safety — the number-one concern.⁴ Gartner expects roughly a third of enterprise software to embed agentic AI by 2028, up from less than 1% in 2024 — while also predicting that over 40% of agentic projects will be cancelled by end of 2027, citing inadequate foundations among the causes.⁵

The reframe

RAG is chat-native; the wiki is agent-native. The chatbot era assumed one turn, stateless, read-only. The agent era inverts all three — so the retrieval layer itself has to change.

The four things an agent needs from its memory

Strip away the vendor noise and an agentic workload makes four demands on its memory that a single chatbot turn never did. On each one, a wiki-graph — a maintained map of atomic claims and typed edges rather than a pile of text chunks — is native, and RAG was simply never built to deliver. This is the whole argument in one frame:

Need 1

Traverse, don’t look up

Multi-step agents reason across hops. A wiki’s typed edges are navigable paths; the relationship is stored once and walked. RAG returns disconnected chunks and re-derives every relationship from similarity at query time.

Need 2

Write, don’t just read

Agents close loops; they must update memory during and after a run. A wiki is writable — an ingestion agent adds claims, a janitor compacts them. RAG is read-only in the loop; re-embedding is an offline batch, not an in-loop action.

Need 3

Hold state, don’t re-fetch

Long runs need durable state outside the model’s head. The wiki is that external state. RAG holds nothing between steps, and a bigger context window is lossy working memory, not addressable state.

Need 4

Compound, don’t repeat

Each loop should leave the next one smarter. Wiki + janitor compounds — yesterday’s understanding becomes today’s map. RAG returns the same chunks forever, no matter how many times the agent has seen the case.

None of these is a quirk of one product; the agent-memory literature documents each as a real gap and reaches for a graph or a stateful store to fix it.

Traverse. The vendor of RAG tooling concedes the point itself: Microsoft’s GraphRAG team writes that “baseline RAG struggles to connect the dots… when answering a question requires traversing disparate pieces of information through their shared attributes.”⁶ Academically, HippoRAG shows graph-structured memory beating vector RAG on multi-hop question answering “by up to 20%,” and doing the multi-hop in a single retrieval step instead of re-deriving the relationships query by query.⁷ The relationship an agent needs to walk is precomputed in the graph and rediscovered, expensively, on every RAG query.

Write & compound. Every serious agent-memory system adds a writable, consolidating layer precisely because retrieval alone can’t update or accumulate. Stanford’s Generative Agents store “a complete record of the agent’s experiences… synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan.”⁸ Reflexion agents “maintain their own reflective text in an episodic memory buffer” and measurably improve across trials.⁹ Mem0 reports a writable, optionally graph-structured memory beating full-context baselines with “26% relative improvements” in quality while cutting token cost by over 90%.¹⁰ The reflex objection — “just put it all in a giant context window” — fails on its own terms: “Lost in the Middle” shows accuracy sagging when the relevant fact sits mid-context,¹¹ and Chroma’s testing of 18 frontier models found they all degrade as input grows, “often in surprising and non-uniform ways,” well within their advertised limits.¹²

Hold state. Production agent frameworks already put state in an external, durable store and resume from it — LangGraph’s persistence layer exists so an application can “keep useful information beyond a single graph run.”¹³ MemGPT made the analogy explicit, giving the agent an operating-system-style memory hierarchy outside the context window.¹⁴ And on compounding, a16z says it most bluntly: “retrieval is not learning. A system that can look up any fact has not been forced to find structure.”¹⁵ A read-only index cannot compound, because nothing about the act of using it makes the next use better.

Notice what just happened. Four independent research threads, each chasing a different agentic failure, all converge on the same prescription: give the agent something traversable, writable, stateful, and compounding. That something is a graph you maintain — a wiki.

Same loop, two substrates

Abstractions are cheap, so run my Tesla repair as an actual agent loop, twice, and watch the four needs bite.

On RAG. The ticket arrives. The agent embeds “heated seat not working” and searches the service history. It gets back chunks — some manuals, some closed tickets that mention heated seats — and generates a guess. It cannot walk from heated-seat complaint to occupancy-sensor fault to 2023 seat-module change, because no edge exists; each of those is a separate chunk, related only if the model happens to infer it. When the proposal fails a customer-trust check, the agent re-searches the same corpus and gets the same chunks. Nothing is written back. The thousandth identical heated-seat case is handled exactly like the first. The agent is fast, and it is amnesiac.

On a wiki. The agent lands on [[Heated Seat Complaints]] and traverses a typed edge — possible-upstream-cause → [[Driver Occupancy Sensor]] — then to [[Model 3 Seat Module]] and [[First Visit Resolution]]. The relationships were compiled from thousands of closed cases before this ticket arrived. It holds its candidate proposals as state across repair attempts. And when the case closes, the outcome — which part was actually used, whether there was a return visit — writes back. The next ticket starts from a better map.

Same loop. Same model. Two substrates, and only one of them is doing the job the agent actually has. The difference isn’t intelligence; it’s where the relationships live. Which is why the deepest line from this whole investigation is about knowing versus searching:

A naked LLM doesn’t understand the domain. RAG can search the domain. A wiki-graph can begin to know it.

A workshop manual tells you how the car is supposed to work. The wiki captures how it actually fails, how customers describe it, how often the first diagnosis is wrong, which model years changed the failure mode, and the senior technician’s tacit rule — “when the customer says X but the code says Y, don’t trust the obvious part.” That is dependency-shaped, compiled operational experience that plain chunk retrieval structurally cannot see — the problem I’ve called the Cognition Supply Chain: the model is rarely the bottleneck; the supply chain feeding it is.

Why the embedding caps your agent

Here is the part that took building both to see. In RAG, the embedding plays the role of intelligence — turn the query into a vector, compare against stored vectors, return the nearest blobs. It is brilliant for speed, and it is thin. So however smart your agent gets, it is held back behind the RAG search: it can only act on the nearest few chunks from a couple of semi-random queries. The wiki is the opposite — it compounds on the agent’s intelligence, because a smarter agent extracts more from the map, the edges and the source anchors, and leaves the map better for the next one.

That difference shows up in three concrete ways. RAG confuses relevance with authority: a second-hand mention can sit closer in embedding space than the canonical source, and the agent can’t tell. Nearest is not canonical. RAG also can’t tell you what it missed — it has returned chunks, not a map — so it has no real stopping condition. And RAG gets more complex, not less, as you try to make it smarter: bolt on rerankers, metadata, freshness, source-hierarchy, synthetic chunks, and re-index every time. Which produces the punchline of the whole exercise:

The complexity inversion

The more you ask RAG to understand — canonicality, multi-hop, freshness, state, write-back — the more it starts wanting to become a wiki. You end up rebuilding the wiki badly, inside the retrieval layer.

RAG hides its intelligence inside similarity math you can’t inspect. The wiki externalises intelligence as structure you can read: claims, typed edges, canonical references, the reasons an edge exists. I’ve made the architecture-and-economics case for that structure in full elsewhere — The Index Is the Data — so I won’t re-derive the engine here. The point for this argument is narrower: structure you can read is structure a stronger agent can compound on.

From searcher to navigator

The most convincing evidence I have isn’t a benchmark; it’s a behaviour I watched. I built a reviewer agent that gardens a wiki-graph. Early on, when the graph was sparse, it leaned hard on broad search — it didn’t know the terrain, so it had to hunt. Then I changed the first prompt: instead of just handing it a search tool, I injected the map — the root pages, a short description of each, and the list of available ebooks. From that point its behaviour changed. As the graph connected up over successive runs, the broad-search calls fell away almost entirely. It did a few targeted page reads, anchored back to the source ebooks, and navigated by edges. It had become a navigator instead of a searcher.

That phase transition is the thesis in miniature, and it taught me three things worth keeping. The map matters more than the index — a wiki gives the agent a map before the question; RAG gives it fragments after the question. The semantics live in the descriptions: IDs locate, titles name, descriptions orient — and a tool the agent has but can’t inventory is not really available to it. And falling search is a quality signal, but only the right kind: broad search should decline as the graph matures while source-anchoring stays high; search that disappears because the model got overconfident is a different, worse thing.

Blob — documents exist but aren’t mapped.
Catalogue — pages and sources are listed.
Description — each page has a short semantic role (the real payload).
Edge — pages are connected by typed claims and relationships.
Navigator — agents use the map and edges instead of broad search.
Reviewer — agents improve the map for the agents that come next.
Governance — traces show which map, edges and sources were used.

RAG mostly lives on the bottom rung and stays there. Every query starts cold, assembles context around the question, and leaves the corpus no better organised than it found it. The wiki climbs — and the climbing is the compounding.

Keep RAG for what it’s best at

This is the part the “graphs win” crowd skips, and it’s the part that makes the rest credible. The wiki is not free and it is not always right. It needs an LLM as its intelligence layer, which makes it slower and more expensive per query than a vector search. It has a higher setup cost. And self-maintaining is not unsupervised: a janitor agent can over-compress two distinct failure modes or calcify bad lore into a confident claim, so consolidations need human-auditable diffs and reverts. A wiki you don’t tend will mislead you with more authority than a pile of chunks ever could.

So this was never “wiki beats RAG.” It’s a question of fit. RAG is a fast recall layer; the wiki is a slow cognition layer. Keep RAG for exactly what it’s brilliant at: single-shot lookups, fuzzy recall of a forgotten mention, freshness, exact figures, and the long tail of documents not worth pre-digesting. The mature architecture isn’t a winner; it’s an assignment of each tool to the job it was built for:

RAG is the index. The wiki is the map. The LLM is the navigator. The DAG is the law. The receipt is the proof.

Even the figures should split along the same seam: keep durable relationships in the graph (“occupancy sensor is a common upstream cause under these conditions”) and route to live source systems for the exact current percentage, because stale numbers are dangerous while relationships are directionally durable. RAG for reflex; the wiki for reasoning; the source of record for the number.

This also reframes what “AI learning” even means. People imagine learning as something baked into a bigger model. But you cannot fine-tune inside a loop on every closed case — you can write to a wiki, instantly, cheaply, inspectably, reversibly. The neighbouring doctrine fills in the rest: the loop is robust because the durable state lives outside the agent (Designing Loops, Not Prompts), and once the knowledge is an external, versioned artefact you can also govern it — replay exactly what the agent knew when it acted (The Model Is Not the Memory). The line that ties it together:

The takeaway

Closed-loop AI doesn’t mean the model learns — it means the organisation remembers. So as your AI goes agentic, change the memory, not just the model: stand up a wiki-graph as the substrate your agents traverse, write, hold and compound — and keep RAG for the single-shot lookups it’s still best at.

RAG isn’t failing you. It did the job it was built for — the chatbot turn — about as well as that job can be done. Your agents just have a different job. Give them the memory it actually requires.

References

The origin & the shifting workload

[1]Lewis, P. et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS 2020 (arXiv:2005.11401). — RAG combines “pre-trained parametric and non-parametric memory for language generation” and models “generate more specific, diverse and factual language.” https://arxiv.org/abs/2005.11401
[2]Anthropic. “Building Effective Agents” (19 Dec 2024). — Agents are used for “open-ended problems where it’s difficult or impossible to predict the required number of steps” and “the LLM will potentially operate for many turns.” https://www.anthropic.com/research/building-effective-agents
[3]LangChain. “State of AI 2024” (LangSmith telemetry, 19 Dec 2024). — Average steps per trace rose from 2.8 (2023) to 7.7 (2024); ~21.9% of traces now involve tool calls, up from 0.5%. https://www.langchain.com/blog/langchain-state-of-ai-2024
[4]LangChain. “State of AI Agents” (survey of 1,300+ professionals). — “About 51% of respondents are using agents in production today”; performance quality is the top concern. https://www.langchain.com/stateofaiagents
[5]Gartner. “Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027” (press release, 25 Jun 2025); related forecast: ~33% of enterprise software applications will include agentic AI by 2028, up from less than 1% in 2024. https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027

Traverse — multi-hop / relational reasoning

[6]Larson, J. & Truitt, S. “GraphRAG: Unlocking LLM discovery on narrative private data.” Microsoft Research Blog (13 Feb 2024). — “Baseline RAG struggles to connect the dots… when answering a question requires traversing disparate pieces of information through their shared attributes.” https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
[7]Jiménez Gutiérrez, B. et al. “HippoRAG: Neurobiologically Inspired Long-Term Memory for LLMs.” NeurIPS 2024 (arXiv:2405.14831). — Knowledge-graph retrieval “outperforms the state-of-the-art methods remarkably, by up to 20%” on multi-hop QA, in a single retrieval step. https://arxiv.org/abs/2405.14831

Write & compound — writable, consolidating memory

[8]Park, J.S. et al. “Generative Agents: Interactive Simulacra of Human Behavior.” arXiv:2304.03442. — Stores “a complete record of the agent’s experiences… synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior.” https://arxiv.org/abs/2304.03442
[9]Shinn, N. et al. “Reflexion: Language Agents with Verbal Reinforcement Learning.” arXiv:2303.11366. — Agents “maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials.” https://arxiv.org/abs/2303.11366
[10]Chhikara, P. et al. “Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.” ECAI 2025 (arXiv:2504.19413). — Dynamically “extracting, consolidating, and retrieving salient information”; graph variant; “26% relative improvements” over OpenAI baseline with >90% token-cost saving. https://arxiv.org/abs/2504.19413
[15]Aubakirova, M. & Bornstein, M. “Why We Need Continual Learning.” Andreessen Horowitz (a16z), 22 Apr 2026. — “Retrieval is not learning. A system that can look up any fact has not been forced to find structure.” https://a16z.com/why-we-need-continual-learning/

Hold state — long context is not durable memory

[11]Liu, N.F. et al. “Lost in the Middle: How Language Models Use Long Contexts.” TACL Vol. 12, 2024 (arXiv:2307.03172). — Performance “significantly degrades when models must access relevant information in the middle of long contexts” (a U-shaped curve). https://aclanthology.org/2024.tacl-1.9/
[12]Hong, K., Troynikov, A. & Huber, J. “Context Rot: How Increasing Input Tokens Impacts LLM Performance.” Chroma Research (14 Jul 2025). — Across 18 frontier models, “performance grows increasingly unreliable as input length grows,” often within declared context windows. https://www.trychroma.com/research/context-rot
[13]LangChain. “LangGraph — Persistence & durable execution” (documentation). — “Persistence lets LangGraph applications keep useful information beyond a single graph run,” via checkpointers (short-term) and stores (long-term, cross-thread). https://docs.langchain.com/oss/python/langgraph/persistence
[14]Packer, C. et al. “MemGPT: Towards LLMs as Operating Systems.” arXiv:2310.08560. — “Virtual context management… drawing inspiration from hierarchical memory systems in traditional operating systems,” paging durable memory in and out of the context window. https://arxiv.org/abs/2310.08560

LeverageAI — prior work (the author’s own frameworks; ideas, not statistics)

Farrell, S. “The Index Is the Data: How a Self-Cleaning Wiki-Graph Out-Thinks RAG.” LeverageAI. https://leverageai.com.au/the-index-is-the-data-how-a-self-cleaning-wiki-graph-out-thinks-rag/
Farrell, S. “Designing Loops, Not Prompts: A Field Guide to Agentic Loops and Who Holds the State Machine.” LeverageAI. https://leverageai.com.au/designing-loops-not-prompts-a-field-guide-to-agentic-loops-and-who-holds-the-state-machine/
Farrell, S. “The Model Is Not the Memory: Why Governable AI Needs a Wiki, Not Just RAG.” LeverageAI. https://leverageai.com.au/the-model-is-not-the-memory-why-governable-ai-needs-a-wiki-not-just-rag/
Farrell, S. “The Cognition Supply Chain: From Search to Compounding Agentic Cognition.” LeverageAI. https://leverageai.com.au/the-cognition-supply-chain-from-search-to-compounding-agentic-cognition/

External statistics and quotations are drawn from the primary sources above (each directly retrieved); framework ideas are the author’s own prior work and are cited for further reading rather than as evidence. URLs are plain text for verification.

Discover more from Leverage AI for your business

Subscribe to get the latest posts sent to your email.

RAG Was Built for Chatbots — Agents Need a Wiki

RAG Was Built for Chatbots — Agents Need a Wiki

RAG solved a different problem

The reframe

The four things an agent needs from its memory

Traverse, don’t look up

Write, don’t just read

Hold state, don’t re-fetch

Compound, don’t repeat

Same loop, two substrates

Why the embedding caps your agent

The complexity inversion

From searcher to navigator

Keep RAG for what it’s best at

The takeaway

References

The origin & the shifting workload

Traverse — multi-hop / relational reasoning

Write & compound — writable, consolidating memory

Hold state — long context is not durable memory

LeverageAI — prior work (the author’s own frameworks; ideas, not statistics)

Related

Discover more from Leverage AI for your business

Leave a Reply Cancel reply

Terms of Use