The RAG Problem Nobody Talks About#
By mid-2026, nearly every serious AI application uses some form of Retrieval-Augmented Generation. But if you’ve deployed RAG in production, you know the dirty secret: most RAG pipelines fail silently. They retrieve documents that look relevant but don’t actually answer the question. They hallucinate citations. They return stale data from indices that haven’t been updated.
The 2024-era “embed → retrieve → stuff into prompt” pipeline is dead. What’s replaced it is Agentic RAG — a paradigm where the retrieval process itself is driven by an AI agent that reasons about what information it needs, evaluates the quality of what it finds, and iterates until it has a trustworthy answer.
This article walks through building a production-grade Agentic RAG system using the tools available in 2026: MCP for tool orchestration, knowledge graphs for structured reasoning, and modern LLM APIs for self-correction loops.
1. Why Traditional RAG Falls Apart#
Consider a real-world scenario: a developer asks your AI assistant, “How do I configure rate limiting for the XiDao API Gateway when using the Claude 4.7 API with streaming responses?”
A traditional RAG pipeline would:
- Embed the query
- Perform vector similarity search against your documentation index
- Retrieve the top-k chunks
- Stuff them into the context window
- Generate an answer
The failure modes are predictable:
- Semantic mismatch: The embedding captures “rate limiting” but misses that the question is specifically about streaming + Claude 4.7 + XiDao Gateway — three intersecting concerns.
- Chunk fragmentation: The answer spans three different documentation pages. Each chunk individually scores below the similarity threshold.
- Stale retrieval: The index contains docs from 2025, but Claude 4.7’s streaming API has different rate limit headers.
The result: a confident, well-formatted, completely wrong answer.
2. The Agentic RAG Architecture#
Agentic RAG treats retrieval as a planning problem. Instead of a single retrieve-and-generate step, an agent:
- Decomposes the query into sub-questions
- Plans which retrieval tools to use (vector search, knowledge graph, web search, API docs lookup)
- Executes retrieval iteratively
- Evaluates the quality and completeness of retrieved information
- Self-corrects by generating new retrieval queries when gaps are found
- Synthesizes a final answer with verified citations
Here’s the high-level architecture:
┌─────────────────────────────────────────────────────┐
│ User Query │
└──────────────────────┬──────────────────────────────┘
▼
┌──────────────────────────────────────────────────────┐
│ Query Decomposition Agent │
│ "Break this into sub-questions" │
└──────────────────────┬───────────────────────────────┘
▼
┌──────────────────────────────────────────────────────┐
│ Retrieval Planning Agent │
│ "Which tools for each sub-question?" │
└─────┬──────────┬───────────┬──────────┬──────────────┘
▼ ▼ ▼ ▼
┌──────────┐ ┌─────────┐ ┌────────┐ ┌──────────────┐
│ Vector │ │Knowledge│ │ Web │ │ Code Search │
│ Search │ │ Graph │ │ Search │ │ (grep/ast) │
└────┬─────┘ └────┬────┘ └───┬────┘ └──────┬───────┘
└────────────┴──────────┴─────────────┘
▼
┌──────────────────────────────────────────────────────┐
│ Quality Evaluation Agent │
│ "Is this sufficient? Any gaps or contradictions?" │
└──────────────────────┬───────────────────────────────┘
▼
┌────────────────┐
│ Gap detected? │
└───┬────────┬───┘
Yes ▼ ▼ No
┌──────────┐ ┌──────────────────┐
│ Re-query │ │ Final Synthesis │
└──────────┘ └──────────────────┘3. Implementation: MCP-Powered Agentic RAG#
In 2026, the cleanest way to build this is using MCP (Model Context Protocol) to expose your retrieval tools, and letting an LLM agent orchestrate them.
3.1 Setting Up MCP Retrieval Tools#
First, define your retrieval tools as MCP servers:
# mcp_retrieval_server.py
from mcp.server import Server
from mcp.types import Tool, TextContent
import chromadb
from neo4j import GraphDatabase
server = Server("agentic-rag-tools")
# Vector store client
chroma = chromadb.PersistentClient(path="./embeddings_db")
doc_collection = chroma.get_collection("documentation")
# Knowledge graph client
neo4j_driver = GraphDatabase.driver(
"bolt://localhost:7687",
auth=("neo4j", "password")
)
@server.tool()
async def vector_search(query: str, top_k: int = 5) -> list[TextContent]:
"""Semantic search across documentation embeddings.
Best for: finding conceptually related content, natural language queries."""
results = doc_collection.query(
query_texts=[query],
n_results=top_k,
include=["documents", "metadatas", "distances"]
)
formatted = []
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
formatted.append(TextContent(
type="text",
text=f"[Score: {1-dist:.3f}] {meta['source']}\n{doc}\n"
))
return formatted
@server.tool()
async def knowledge_graph_query(cypher: str) -> list[TextContent]:
"""Query the knowledge graph for structured relationships.
Best for: entity relationships, hierarchies, 'what connects X to Y' queries.
Uses Cypher query language against a Neo4j graph."""
with neo4j_driver.session() as session:
result = session.run(cypher)
records = [dict(r) for r in result]
return [TextContent(
type="text",
text=f"Graph query returned {len(records)} results:\n"
+ "\n".join(str(r) for r in records[:20])
)]
@server.tool()
async def hybrid_search(
query: str,
entity_filter: str = "",
top_k: int = 5
) -> list[TextContent]:
"""Combined vector + graph search.
First extracts entities from the query, finds related graph nodes,
then uses graph context to re-rank vector results."""
# Step 1: Extract entities (simplified — use NER in production)
# Step 2: Graph traversal for entity context
# Step 3: Vector search with graph-boosted scoring
# ... implementation details ...
pass
@server.tool()
async def check_document_freshness(
source_path: str
) -> list[TextContent]:
"""Check when a document was last updated.
Returns the last modified date and version info."""
# Query your document management system
# ...
pass3.2 The Agentic Orchestrator#
Now build the agent that uses these tools. The key insight: use structured reasoning, not free-form generation.
# agentic_rag.py
import asyncio
from dataclasses import dataclass, field
from openai import AsyncOpenAI
client = AsyncOpenAI(base_url="https://api.xidao.online/v1")
RETRIEVAL_AGENT_SYSTEM_PROMPT = """You are a retrieval planning agent.
Given a user query, you must:
1. DECOMPOSE the query into specific sub-questions
2. For each sub-question, SELECT the best retrieval tool:
- vector_search: for conceptual/natural language queries
- knowledge_graph_query: for entity relationships, structured data
- hybrid_search: for complex queries needing both
- check_document_freshness: for time-sensitive information
3. GENERATE optimized retrieval queries (not just echoing the user's words)
4. EXECUTE tools and EVALUATE results
5. If results are insufficient, GENERATE new queries and retry (max 3 rounds)
Output format (JSON):
{
"sub_questions": ["..."],
"retrieval_plan": [
{"tool": "...", "query": "...", "rationale": "..."}
],
"evaluation": {
"sufficient": true/false,
"gaps": ["..."],
"next_action": "..."
}
}"""
@dataclass
class AgenticRAG:
max_iterations: int = 3
min_confidence: float = 0.7
async def query(self, user_query: str) -> str:
"""Execute the full agentic RAG pipeline."""
# Phase 1: Decompose and plan
decomposition = await self._decompose_query(user_query)
# Phase 2: Iterative retrieval
all_context = []
for iteration in range(self.max_iterations):
# Execute retrieval plan
results = await self._execute_retrieval(
decomposition["retrieval_plan"]
)
all_context.extend(results)
# Phase 3: Evaluate quality
evaluation = await self._evaluate_results(
user_query, all_context
)
if evaluation["sufficient"]:
break
# Phase 4: Self-correct — generate new queries for gaps
decomposition["retrieval_plan"] = (
await self._generate_correction_queries(
evaluation["gaps"]
)
)
# Phase 5: Synthesize final answer
return await self._synthesize(user_query, all_context)
async def _decompose_query(self, query: str) -> dict:
response = await client.chat.completions.create(
model="claude-4.7-sonnet",
messages=[
{"role": "system", "content": RETRIEVAL_AGENT_SYSTEM_PROMPT},
{"role": "user", "content": f"Decompose and plan retrieval for: {query}"}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
async def _evaluate_results(
self, query: str, context: list
) -> dict:
"""Use an LLM to judge retrieval quality."""
eval_prompt = f"""Evaluate if the retrieved context is sufficient
to answer this query comprehensively and accurately.
Query: {query}
Retrieved Context:
{self._format_context(context)}
Rate:
1. Coverage: Does the context address ALL aspects of the query?
2. Relevance: Is the context actually about what was asked?
3. Freshness: Is the information current (2026)?
4. Consistency: Are there contradictions?
Output JSON: {{"sufficient": bool, "confidence": float, "gaps": [...]}}"""
response = await client.chat.completions.create(
model="claude-4.7-sonnet",
messages=[{"role": "user", "content": eval_prompt}],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)4. Knowledge Graphs: The Missing Layer#
Vector search alone isn’t enough. In 2026, the most effective RAG systems combine vectors with knowledge graphs for structured reasoning.
4.1 Building Your Knowledge Graph#
Extract entities and relationships from your documents:
# kg_builder.py
import spacy
from neo4j import GraphDatabase
nlp = spacy.load("en_core_web_trf")
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
def build_knowledge_graph(documents: list[dict]):
"""Extract entities and relationships from documents,
then store them in Neo4j."""
with driver.session() as session:
for doc in documents:
text = doc["content"]
source = doc["source"]
# Use spaCy NER + LLM-enhanced relation extraction
entities = extract_entities(text)
relations = extract_relations(text, entities)
# Create document node
session.run("""
MERGE (d:Document {source: $source})
SET d.updated = datetime()
""", source=source)
# Create entity nodes and relationships
for entity in entities:
session.run("""
MERGE (e:Entity {name: $name, type: $type})
WITH e
MATCH (d:Document {source: $source})
MERGE (d)-[:MENTIONS]->(e)
""", name=entity["name"],
type=entity["type"],
source=source)
for rel in relations:
session.run("""
MATCH (a:Entity {name: $source_name})
MATCH (b:Entity {name: $target_name})
MERGE (a)-[r:RELATES_TO {type: $rel_type}]->(b)
SET r.confidence = $confidence
""", source_name=rel["source"],
target_name=rel["target"],
rel_type=rel["relation"],
confidence=rel["confidence"])4.2 Graph-Enhanced Retrieval#
When a query arrives, use the graph to find context that vector search would miss:
def graph_enhanced_retrieval(query_entities: list[str]) -> dict:
"""Given entities found in a query, traverse the graph
to find related context."""
with driver.session() as session:
# Find 2-hop neighbors of query entities
result = session.run("""
UNWIND $entities AS entity_name
MATCH (e:Entity {name: entity_name})
MATCH (e)-[:RELATES_TO*1..2]-(related:Entity)
MATCH (related)<-[:MENTIONS]-(d:Document)
RETURN DISTINCT d.source AS source,
collect(DISTINCT related.name) AS related_entities,
count(*) AS relevance
ORDER BY relevance DESC
LIMIT 10
""", entities=query_entities)
return [dict(r) for r in result]5. Production Considerations#
5.1 API Gateway Integration#
In production, your Agentic RAG system sits behind an API gateway. The gateway handles:
- Rate limiting: Per-user, per-token limits for retrieval operations
- Caching: Cache frequent retrieval results at the gateway layer
- Fallback routing: If the primary LLM is slow, route evaluation queries to a faster model
- Observability: Track retrieval latency, hit rates, and correction loop counts
# api-gateway config (XiDao Gateway example)
routes:
- path: /api/rag/query
methods: [POST]
plugins:
- rate_limiting:
second: 10
policy: token_bucket
- response_cache:
ttl: 300
vary_by:
- body.query_hash
- observability:
trace_retrieval: true
log_corrections: true
upstream:
targets:
- url: http://rag-service:8080
weight: 90
- url: http://rag-fallback:8080
weight: 105.2 Monitoring Self-Correction Loops#
The self-correction loop is where Agentic RAG shines — but also where costs can spiral. Track these metrics:
# Key metrics to monitor
metrics = {
"retrieval_iterations": histogram(
"rag_retrieval_iterations",
"Number of retrieval rounds before convergence",
buckets=[1, 2, 3, 5, 8]
),
"correction_rate": counter(
"rag_corrections_total",
"Number of self-corrections triggered"
),
"confidence_distribution": histogram(
"rag_confidence_score",
"Final confidence score distribution",
buckets=[0.1, 0.3, 0.5, 0.7, 0.8, 0.9, 0.95]
),
"tool_usage": counter(
"rag_tool_invocations_total",
"Retrieval tool usage",
["tool_name", "iteration"]
)
}5.3 Cost Optimization#
Agentic RAG uses significantly more tokens than traditional RAG. Strategies to manage cost:
| Strategy | Token Savings | Trade-off |
|---|---|---|
| Use fast models for evaluation | ~40% | Slightly lower evaluation quality |
| Cache retrieval results | ~60% | Potential staleness |
| Limit max iterations to 3 | ~25% | May miss edge cases |
| Use embeddings for initial filtering | ~30% | Additional infra complexity |
| Batch sub-question queries | ~20% | Increased latency |
6. Complete Working Example#
Here’s a minimal but complete Agentic RAG system you can run today:
# complete_example.py
"""
Agentic RAG with MCP — 2026 Edition
Requires: pip install openai mcp neo4j chromadb
"""
import asyncio
import json
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="https://api.xidao.online/v1",
api_key="your-api-key"
)
async def agentic_rag(user_query: str) -> str:
"""Complete agentic RAG pipeline."""
messages = [
{"role": "system", "content": """You are an Agentic RAG system.
You have access to these tools:
- vector_search(query, top_k): Search document embeddings
- knowledge_graph_query(cypher): Query the knowledge graph
- web_search(query): Search the web for latest information
Process:
1. Break the query into sub-questions
2. For each, choose the best tool and search query
3. Evaluate if you have enough info
4. If not, search again with refined queries
5. Synthesize a comprehensive answer with citations
Be thorough. Always verify facts. Cite sources."""},
{"role": "user", "content": user_query}
]
# Define MCP-style tools
tools = [
{
"type": "function",
"function": {
"name": "vector_search",
"description": "Semantic search across documentation",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"top_k": {"type": "integer", "default": 5}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "knowledge_graph_query",
"description": "Query knowledge graph with Cypher",
"parameters": {
"type": "object",
"properties": {
"cypher": {"type": "string"}
},
"required": ["cypher"]
}
}
}
]
# Agentic loop with max 5 tool-use rounds
for round_num in range(5):
response = await client.chat.completions.create(
model="claude-4.7-sonnet",
messages=messages,
tools=tools,
tool_choice="auto"
)
assistant_msg = response.choices[0].message
messages.append(assistant_msg)
# If no tool calls, we have our final answer
if not assistant_msg.tool_calls:
return assistant_msg.content
# Execute tool calls
for tool_call in assistant_msg.tool_calls:
result = await execute_tool(
tool_call.function.name,
json.loads(tool_call.function.arguments)
)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
# Fallback: synthesize from what we have
return await client.chat.completions.create(
model="claude-4.7-sonnet",
messages=messages + [
{"role": "user", "content":
"Please synthesize your best answer from the information gathered."}
]
)
async def execute_tool(name: str, args: dict):
"""Execute a retrieval tool and return results."""
if name == "vector_search":
# Call your vector store
return {"results": ["..."]}
elif name == "knowledge_graph_query":
# Call Neo4j
return {"results": ["..."]}
# Run it
if __name__ == "__main__":
answer = asyncio.run(agentic_rag(
"How does XiDao API Gateway handle streaming "
"rate limits for Claude 4.7?"
))
print(answer)7. What’s Next: The Convergence of RAG, Agents, and MCP#
The trajectory is clear. By late 2026, the boundaries between RAG systems, AI agents, and tool orchestration protocols will blur completely. The pattern we’re seeing:
- RAG becomes agentic: Static pipelines give way to reasoning-driven retrieval
- Agents become RAG-aware: Every agent call implicitly includes retrieval
- MCP unifies the tool layer: A single protocol for retrieval, execution, and communication
- Knowledge graphs become standard: Not replacing vectors, but complementing them
The teams building this convergence today will have a significant advantage when the next generation of AI-native applications arrives.
Key Takeaways#
- Traditional RAG fails silently — confident but wrong answers are the norm, not the exception
- Agentic RAG treats retrieval as planning — decompose, retrieve, evaluate, self-correct, synthesize
- Knowledge graphs add structured reasoning — vectors find similarities, graphs find relationships
- MCP provides the tool layer — one protocol for all retrieval and tool operations
- Monitor correction loops — this is where quality improves but costs can escalate
- Start simple, iterate — a 2-round agentic RAG beats a 5-round one that’s never shipped
Build the pipeline. Ship it. Measure it. Improve it. That’s the 2026 way.