TL;DR: Semantic RAG assumes the query embedding lands near the answer embedding. For multi-step questions — comparisons, computations, cross-document analysis — that assumption fails. Here are five architectural patterns that fix this: embrace agents over pipelines, separate storage by data type, route deterministic operations to deterministic tools, show your work, and build systems that know when they don't know.


Semantic RAG rests on a core assumption: the vector embedding of your query will be close to the vector embedding of the answer. For simple questions — "what does this document say about X?" — that holds.

"Compare NVIDIA and AMD's gross margins for FY2024" isn't answered by any single passage. It requires retrieving NVIDIA's financials, retrieving AMD's financials separately, extracting specific numbers from each, computing margins, and comparing them. That's five steps. No single embedding will land near all of them.

This is the fundamental limitation of semantic retrieval: it maps one query to one region of vector space. Multi-step questions need to be decomposed into sub-queries, each retrieving from a different source, then combined with deterministic computation.

This isn't a finance problem. A doctor comparing lab results across visits hits the same wall. So does an engineer pulling specs from four datasheets, or a developer tracing a bug across services. Questions requiring multiple lookups and computation break semantic retrieval.

Pattern 1: Embrace Agents

A standard RAG pipeline doesn't distinguish between "what does this policy cover?" and "compare three vendors on cost, lead time, and failure rate." Both get embedded, both retrieve chunks, both generate. The pipeline has no concept of harder.

That second question needs the system to figure out what's being asked, pull data from different places, and combine results. Pipelines can't do that. Agents can.

What works better is a stateful agent — one that plans before it acts:

Understand the query and plan. Before touching any data, the agent reasons about what the question requires. "Which vendor meets our specs for cost, lead time, and tensile strength?" isn't one retrieval — it's three lookups across different datasheets, followed by a cross-check. The agent builds a plan: which tools to call, with what inputs, in what order.

Gather the required context. The agent executes the plan — querying databases, searching vector stores, calling APIs — routing each sub-question to the right source. Independent steps run in parallel.

Synthesize the answer. The agent assembles results from structured intermediate outputs, not from a single context-stuffed prompt.

from your_favorite_framework import Agent, tool

# 1. Define deterministic tools for the agent to use
@tool
def sql_query(company_name: str) -> dict:
    """Fetch operating income and revenue for a given company."""
    # Production: execute actual SQL against your structured DB
    pass

@tool
def calculate_margin(operating_income: float, revenue: float) -> float:
    """Calculate operating margin. Always use this instead of LLM math."""
    return (operating_income / revenue) * 100

# 2. Initialize the stateful agent with its toolkit
financial_agent = Agent(
    name="FinanceRAG_Router",
    tools=[sql_query, calculate_margin],
    instructions="Plan your steps before acting. Gather data first, then compute."
)

# 3. The agent autonomously maps the query to the tool sequence you showed above
response = financial_agent.run("How does NVIDIA's operating margin compare to AMD's?")

A user reports "requests to /api/checkout are failing intermittently." A pipeline embeds that sentence and retrieves similar log entries. An agent plans: identify the faulty endpoint from the error tracker, pull the trace for a specific failing request, retrieve the relevant logs for that span, cross-reference with recent deployments. Four steps, each informed by the previous.

Pattern 2: Separate Storage by Data Type

Most RAG tutorials chunk everything and embed it. That works until the data has structure. Embeddings destroy structure.

Ask a support system, “How many P0 incidents did we have last month?” The answer is a table row. Embed that table and you lose counting, filtering, and aggregation—the operations the question requires. The embedding for “12 P0 incidents in February” sits near “11 P0 incidents in January.” Close enough to retrieve. Not close enough to be correct.

Storage should match how the data is queried. Data that requires filtering, sorting, or aggregation belongs in structured stores where those operations are exact. Semantic lookup belongs in vector indexes. Tables and time series carry rows, columns, and types; flattening them into text removes the structure required for precise answers. Narrative content—documentation, commentary, explanations—is already unstructured and fits embeddings well.

Agents also need routing metadata: entity, time period, and methodology. Without this registry, the system cannot narrow its search before retrieval begins.

Different storage systems serve different query patterns:

Data Type

Best Storage

Strength

Example Question

Unstructured text (docs, reports, tickets)

Vector database (ChromaDB, Pinecone, Weaviate)

Semantic similarity search

“Find documentation about authentication failures.”

Structured data (metrics, tables, time series)

SQL database (Postgres, SQLite)

Exact filtering, aggregation, sorting

“How many P0 incidents occurred last month?”

Relationships between entities

Graph database (Neo4j, Neptune)

Relationship traversal

“How is service A connected to service B?”

Semi-structured records (JSON, events)

Document store (MongoDB)

Flexible schema queries

“Show deployments affecting service X.”

Logs and event streams

Search engine (Elasticsearch)

Full-text search with filters

“Show errors from checkout service yesterday.”

The architectural rule is simple: store data according to the operations required to answer questions about it.

Pattern 3: Route Deterministic Operations to Deterministic Tools

LLMs are probabilistic. They're great at deciding what to do — terrible at doing things that have exactly one right answer. For those operations, hand off to a deterministic tool.

Ask an LLM to compute (26,974 − 16,673) / 26,974 × 100 and it will get it wrong often enough to matter. An AST-parsed calculator gets it right every time:

SAFE_OPERATORS = {
    ast.Add: operator.add,
    ast.Sub: operator.sub,
    ast.Mult: operator.mul,
    ast.Div: operator.truediv,
}

def _safe_eval(self, expression: str) -> float:
    tree = ast.parse(expression, mode='eval')
    return self._eval_node(tree.body)

Math is the obvious case, but the same logic applies to date arithmetic, unit conversions, and data validation. Anywhere there's one right answer, keep the LLM out of the calculation.

Pattern 4: Show Your Work

Every intermediate result should carry provenance — where the data came from and how it was transformed. Not as a debugging afterthought. As a first-class output.

Expression: {total_cost} / {units_shipped}
Bindings:
  {total_cost} = 142,500
    → sql_query on shipments table, logistics_q1.xlsx
  {units_shipped} = 3,200
    → sql_query on shipments table, logistics_q1.xlsx
Result: 44.53 per unit

For retrievals, citations carry location — not "Source: report.pdf" but "report.pdf, page 12, table 'Shipment Summary', row 'March'."

Transcripts let you audit failures mechanically instead of re-running the pipeline and guessing. Any domain where decisions depend on accuracy needs this. Provenance turns a prototype into a system people trust.

Pattern 5: Build Systems That Say "I Don't Know"

Most AI systems optimize for always providing an answer. In domains where wrong answers have consequences, a precise refusal is more valuable than a confident wrong output.

The system should produce structured refusals: a diagnosis of what went wrong, what was found, and what the user could try instead.

Compare metrics computed with different methodologies? "Methodology mismatch — these values aren't directly comparable. Try querying each metric separately." Trend with one data point? "Insufficient data — only Q1 found. Need at least two periods."

Place a validation layer between retrieval and computation. After data is gathered but before any calculation, check: same units? same methodology? same time granularity? same scope? A mismatch triggers a refusal. No hallucinated answer reaches the user.

A clinical system should refuse to compare lab results from different assays. An engineering system should refuse to combine specs tested under different conditions. A legal system should flag when comparing statutes across jurisdictions with different definitions. The ability to say "I can't reliably answer this" is what makes "yes" trustworthy.

These Patterns in Practice

I built FinanceRAG to test these ideas against real SEC filings. A standard RAG pipeline, given "compare NVIDIA and AMD's gross margins," embeds the sentence, retrieves nearby passages, and lets the LLM figure it out. The result is often plausible but wrong — numbers from different fiscal periods, GAAP mixed with non-GAAP, or margins the model computed in its head.

With FinanceRAG, the agent breaks this into independent steps — some are retrievals from the right data store, some are arithmetic. Each step executes, every number gets traced back to its source page, and the system checks whether the figures are actually comparable before combining them. If they aren't (different accounting standards, different fiscal periods), it refuses and tells you why instead of returning a bad number. The system is open source; the patterns work outside finance.

Why This Matters Now

RAG is moving from prototypes to production, and production has different standards. Users expect exact numbers, not approximate ones. Regulated industries need audit trails. Teams need to debug failures without re-running the whole pipeline and guessing.

Most RAG systems aren't failing because the models are bad. They're failing because the architecture around the model isn't doing its job. These five patterns close that gap.