Knowledge is a Tool: RAG for Agentic Systems

Scott Farrell November 4, 2025 0 Comments

Knowledge is a Tool: RAG for Agentic Systems

Created on 2025-09-30 09:19

Published on 2025-10-03 03:30

Your AI agent just hallucinated. Again.

You asked: “What’s our refund policy for enterprise customers?”

It replied: “Based on common industry practices, enterprise customers typically have a 30-day refund window…”

Wrong. Your actual policy is 60 days, documented in your internal knowledge base. The agent guessed. The customer is angry. You have a problem.

The core issue: agents rely on parametric memory (training data) when they should be reaching for verified, current knowledge.

RAG (Retrieval-Augmented Generation) transforms agents from hallucination-prone generalists into knowledge-grounded specialists. Instead of guessing from training data, they pull facts from an internal knowledge base—your API docs, internal processes, codebase, policies—and cite their sources.

In 2025, RAG isn’t optional infrastructure. It’s a tool your agent reaches for, like Playwright for browser testing or HTTP for API calls. The agent decides: “I need grounded facts” → reaches for the knowledge base → returns cited, accurate answers.

What RAG Actually Does

Traditional agent flow:

User: “What’s the rate limit for the /api/users endpoint?” Agent: “Typically REST APIs limit to 100 requests/minute…” [Guess]

RAG-enabled agent flow:

User: “What’s the rate limit for the /api/users endpoint?” Agent: [Searches knowledge base] → Finds: api_docs.md, section “Rate Limits” Agent: “The /api/users endpoint has a rate limit of 1000 requests/hour per API key. (Source: api_docs.md, Rate Limits section)”

Same question. One guesses. One knows.

The RAG Pipeline: Ingest → Query → Cite → Generate


Key technical decisions in the pipeline:

Chunking strategy (makes or breaks accuracy):

  • Dynamic chunking: Adjust size based on query intent (short for facts, long for context)

  • Hierarchical chunking: Section → paragraph → sentence (preserves document structure)

  • Embedding-aware chunking: Use embeddings to detect topic boundaries

  • Metadata-driven chunking: Tag with author, date, confidence, version

Typical constraint: 512 tokens/chunk (fits most vector DB limits, balances specificity vs context).

Embedding models:

  • OpenAI text-embedding-ada-002 (1536 dimensions, strong general-purpose)

  • Cohere embeddings (multilingual, domain-specific fine-tuning)

  • Open-source alternatives (sentence-transformers, cheaper but less accurate)

Vector databases:

  • Pinecone: managed, scales to billions of vectors

  • Chroma: open-source, embedded, great for prototyping

  • FAISS: Meta’s library, fast similarity search, self-hosted

  • Neon (Postgres + pgvector): SQL + vector search in one DB

Retrieval mechanisms:

  • Semantic search: Pure vector similarity (cosine distance)

  • Hybrid search: Dense (embeddings) + sparse (BM25 keyword) fusion

  • MMR (Maximal Marginal Relevance): Maximize diversity, avoid duplicate answers

  • Reranking: Score candidates with cross-encoder for final relevance

Real-World Use Cases for Agentic RAG

1. API/Specification Knowledge

Agent needs to call an internal API but doesn’t know the schema.

Without RAG: Guesses parameter names, gets 400 errors, retries blindly. With RAG: Queries knowledge base for “/api/users schema” → retrieves OpenAPI spec → constructs correct request on first try.

Knowledge base contains: OpenAPI/Swagger specs, GraphQL schemas, REST endpoint docs, authentication flows.

2. Internal Business Processes

Agent automates customer onboarding but needs to follow company-specific workflows.

Without RAG: Follows generic onboarding (sends welcome email, done). With RAG: Queries “enterprise onboarding checklist” → retrieves internal SOP → assigns account manager, provisions sandbox, schedules kickoff call, creates Slack channel.

Knowledge base contains: SOPs, runbooks, compliance policies, approval workflows, escalation paths.

3. Codebase Search

Agent needs to fix a bug but doesn’t know where the authentication logic lives.

Without RAG: Searches file names, opens random files, wastes 10 minutes. With RAG: Queries “JWT token validation implementation” → retrieves auth/jwt.py:validate_token() → reads function, identifies bug, writes fix.

Knowledge base contains: Indexed codebase (functions, classes, docstrings), README files, architecture docs, git commit messages.

4. Customer Support

Agent handles live chat: “Can I upgrade my plan mid-billing cycle?”

Without RAG: “You can typically upgrade anytime, but billing may vary…” With RAG: “Yes. Upgrades are prorated based on remaining days in your cycle. You’ll be charged the difference immediately. (Source: billing_faq.md, Upgrades section)”

Knowledge base contains: FAQ, help center articles, product docs, known issues, troubleshooting guides.

Agentic RAG: The 2025 Pattern

Monolithic RAG: one agent, one retrieval call, hope for the best.

Agentic RAG (2025 standard): multi-agent workflow with specialized roles.

Router Agent Input: User query Decision: Does this need retrieval? (Factual question = yes, creative task = no) Output: Route to Retrieval Agent or skip to Generator

Retrieval Agent Input: Query + knowledge base Action: Embed query → semantic search → fetch top-k chunks Output: 3-5 relevant chunks with metadata (source, section, score)

Validation Agent Input: Query + retrieved chunks Action: Check relevance, filter noise, detect contradictions Output: Validated chunks (or “retrieval failed, escalate”)

Generator Agent Input: Query + validated chunks + citation template Action: Synthesize answer, cite sources Output: “According to [source], the answer is X.”

Orchestrator Coordinates pipeline: Router → Retrieval → Validation → Generator Logs: query, retrieval results, validation outcomes, final answer Handles errors: retry retrieval, escalate to human, return “I don’t know”

Why this beats monolithic RAG:

  • Focused prompts: Each agent has 1 job, 100-line prompt (not 1000)

  • Parallel execution: Retrieval + validation run concurrently

  • Error handling: Validation agent catches bad retrievals before generation

  • Audit trail: Each step logged (query → chunks → decision → answer)

  • Composability: Swap retrieval agent (e.g., upgrade to knowledge graph) without touching generator

Context Economics: Why RAG Wins

Token limits are the bottleneck. GPT-4 Turbo = 128k context, but:

  • Loading entire codebase: 500k tokens (doesn’t fit)

  • Loading entire API docs: 200k tokens (fits, but slow + expensive)

  • RAG: retrieve 3 chunks (1500 tokens), answer question (fast + cheap)

Cost comparison (1000 queries):

Full-context approach (load 50k tokens/query): 50,000 tokens × 1000 queries × $0.01/1k tokens = $500

RAG approach (retrieve 1.5k tokens/query): 1,500 tokens × 1000 queries × $0.01/1k tokens = $15 + Embedding cost: 1000 queries × $0.0001 = $0.10 = $15.10 total (97% cheaper)

Plus: RAG is faster (smaller context = faster inference) and more accurate (focused retrieval beats noisy full-context).

Accuracy improvement: 2025 research shows RAG improves accuracy 20-40% over pure LLM on knowledge-intensive tasks (source: multiple 2025 papers on agentic RAG).

Advanced Pattern: Knowledge Graphs + RAG (2025 Cutting Edge)

Vector search finds similar content. Knowledge graphs find related content.

Example query: “Who approved the Q3 budget?”

Vector-only RAG: Retrieves “Q3 budget document” Knowledge graph + RAG: Retrieves “Q3 budget document” + follows relationship → “Approved by: Jane Doe, CFO, 2025-07-15”

Implementation:

  • Chunk documents → embed → store in vector DB (normal RAG)

  • Extract entities/relationships → store in graph DB (Neo4j, Amazon Neptune)

  • Query: vector search for relevant docs + graph traversal for relationships

  • Combine: “The Q3 budget (doc_id: 1247) was approved by Jane Doe (person_id: 89) on July 15, 2025.”

Use cases: org charts, project dependencies, regulatory compliance (track who approved what when).

The Cite-Back Prompt Template

Without RAG (generic agent):


With RAG (knowledge-grounded agent):


Implementation: Tools for 2025

Frameworks (agentic orchestration):

  • LangChain: Mature, large ecosystem, good for prototyping (https://python.langchain.com/)

  • LlamaIndex: RAG-first framework, strong indexing/querying abstractions (https://www.llamaindex.ai/)

  • LangGraph: Stateful multi-agent workflows, built on LangChain (https://langchain-ai.github.io/langgraph/)

  • CrewAI: Role-based multi-agent systems, good for agentic RAG (https://www.crewai.com/)

Vector databases:

  • Pinecone: Managed, production-grade, scales to billions (https://www.pinecone.io/)

  • Chroma: Open-source, embedded, easy local dev (https://www.trychroma.com/)

  • FAISS: Meta’s library, fast, self-hosted (https://github.com/facebookresearch/faiss)

  • Neon (pgvector): Postgres + vectors, SQL + semantic search (https://neon.tech/)

Embedding models:

  • OpenAI text-embedding-ada-002: $0.0001/1k tokens, 1536-dim, strong general-purpose

  • Cohere Embed v3: Multilingual, domain fine-tuning available

  • sentence-transformers: Open-source, self-hosted, free (https://www.sbert.net/)

Search infrastructure:

  • Azure AI Search: Hybrid search (vector + keyword), integrated RAG (https://learn.microsoft.com/en-us/azure/search/)

  • Elasticsearch: BM25 keyword search + vector plugin (https://www.elastic.co/)

  • OpenSearch: Open-source Elasticsearch fork, vector support (https://opensearch.org/)

Building Your First RAG System: 5-Step Checklist

1. Collect your knowledge

  • API docs (OpenAPI, Swagger, GraphQL schemas)

  • Internal processes (SOPs, runbooks, policies)

  • Codebase (index with docstrings, comments, READMEs)

  • Product docs (user guides, FAQs, troubleshooting)

Format: Markdown, plain text, or structured JSON (easiest to chunk).

2. Chunk intelligently

  • Start simple: fixed 512-token chunks with 50-token overlap

  • Add metadata: source file, section, author, date, version

  • Iterate: try semantic chunking (break on topic shifts) if fixed-size fails

Code example (LlamaIndex):


3. Embed and index

  • Choose embedding model (OpenAI ada-002 for ease, sentence-transformers for cost)

  • Choose vector DB (Chroma for local dev, Pinecone for production)

  • Index all chunks

Code example (LlamaIndex + Chroma):


4. Build the query pipeline

  • Embed user query

  • Retrieve top-k chunks (start with k=3-5)

  • Rerank by relevance (optional but recommended)

  • Return chunks + metadata

Code example:


5. Add citation to generation

  • Pass retrieved chunks to LLM with citation template

  • Instruct: “Cite sources in format (Source: filename, section)”

  • Validate: ensure response includes citations

Prompt template:


Common Pitfalls (and Fixes)

Pitfall 1: Chunks too large → retrieval finds irrelevant sections Fix: Reduce chunk size to 256-512 tokens, increase overlap to 50-100 tokens

Pitfall 2: Chunks too small → loses context (e.g., “it” refers to unknown entity) Fix: Use hierarchical chunking (parent-child), retrieve child but return parent for context

Pitfall 3: Poor retrieval (top-k chunks not relevant) Fix: Switch to hybrid search (vector + BM25), add metadata filters (e.g., “only API docs”), rerank with cross-encoder

Pitfall 4: Agent ignores retrieved chunks, hallucinates anyway Fix: Stronger prompt: “Answer ONLY from context. If no relevant context, say ‘I don’t know.'”

Pitfall 5: No citations (can’t verify answers) Fix: Require citations in prompt, validate output format, reject answers without sources

Pitfall 6: Stale knowledge base (docs updated, index not refreshed) Fix: Automate reindexing (daily/weekly cron), version chunks (track when indexed), add TTL metadata

Business Impact: Why RAG Wins in 2025

  • 30-50% lower LLM costs (retrieve only what’s needed, not full corpus)

  • 20-40% accuracy improvement (grounded in facts, not guesses)

  • Zero retraining required (update knowledge base, not model weights)

  • Auditability (citations enable compliance, debugging, trust)

  • Scalability (add more docs without hitting context limits)

Gartner: RAG is foundational for 75% of enterprise AI by 2026. The teams that win build knowledge bases first, then connect agents to them.

Knowledge is a Tool

Your agent has tools for shell commands, web browsing, API calls. Now give it a tool for knowledge.

The agent decides: “I need grounded facts” → reaches for the knowledge base → cites sources → returns accurate answers.

Stop accepting hallucinations. Build a knowledge base. Give your agent RAG. Watch it become a specialist.

References:

  • Agentic RAG Survey (2025): https://arxiv.org/abs/2501.09136

  • RAG at the Crossroads (Mid-2025 Reflections): https://ragflow.io/blog/rag-at-the-crossroads-mid-2025-reflections-on-ai-evolution

  • NVIDIA RAG Agent Guide: https://developer.nvidia.com/blog/build-a-rag-agent-with-nvidia-nemotron

  • AWS RAG Overview: https://aws.amazon.com/what-is/retrieval-augmented-generation/

  • Azure RAG Architecture: https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview

  • Squirro State of RAG 2025: https://squirro.com/squirro-blog/state-of-rag-genai

  • Knowledge-Based Chatbots 2025 Guide: https://medium.com/@mr.vora212/knowledge-based-chatbots-in-2025-your-ultimate-guide-to-rag-agents-and-beyond-bdd35cffb16c

  • RAG Tools 2025 (Kanerika): https://kanerika.com/blogs/rag-tools/

  • Chunking Strategies for RAG: https://medium.com/aimpact-all-things-ai/data-chunking-strategies-for-rag-in-2025-acfec4707eaf

  • Xenoss Enterprise RAG Guide: https://xenoss.io/blog/enterprise-knowledge-base-llm-rag-architecture

Want the RAG pipeline diagram + cite-back prompt template? DM “KNOWLEDGE” for the full implementation guide (chunking strategy + vector DB setup + prompt templates) to build your first knowledge-grounded agent system.

Leave a Reply

Your email address will not be published. Required fields are marked *