I Built a RAG System for Our Analytics Team. It Worked Great Until We Added Real Data.

I'll tell you the moment I knew our RAG implementation was in trouble.

A product manager asked our internal knowledge assistant: "What's our refund policy for enterprise customers?" The system retrieved three chunks from three different documents. One was from 2022. One was from 2024. One was a draft that never got approved. It combined all three into a confident, well-formatted answer that was wrong in ways that would have cost us money if anyone had acted on it.

The retrieval worked perfectly. The generation worked perfectly. The answer was still wrong. That's the thing about RAG that nobody talks about at the demo stage.

I spent four months building, breaking, and rebuilding a retrieval-augmented generation system for our analytics and BI team. Not a customer support chatbot. Not a document search tool. A system that was supposed to help analysts and stakeholders get answers about our data: what metrics mean, where data comes from, which dashboards to trust, and what our governance policies actually say.

Here's what I learned that the tutorials didn't teach me.

Why RAG for Analytics

Our team had a problem that I suspect most data teams share. Institutional knowledge was scattered across Confluence pages that hadn't been updated in two years, Slack threads that disappeared after 90 days, dbt model descriptions that were sometimes accurate, data dictionaries that were always incomplete, and the heads of three senior analysts who had been around long enough to know where the bodies were buried.

When a new analyst joined, it took them roughly three months to become productive. Not because the tools were hard, but because understanding what the data meant required absorbing context that lived in dozens of disconnected places.

RAG seemed like the obvious solution. Take all that scattered knowledge, embed it, and let people ask questions in natural language. "What does the churn metric include?" "Which table is the source of truth for revenue?" "When did we change the attribution model?"

Simple enough. I was wrong to feel that way.

The Architecture

The initial setup was textbook. Ingest documents from Confluence, Google Docs, and our dbt documentation. Chunk them into 512-token segments using recursive character splitting. Embed them with a standard embedding model. Store the vectors in a vector database. When a user asks a question, embed the query, retrieve the top 5 most similar chunks, stuff them into a prompt, and let the LLM generate an answer.

I had a working prototype in three days. It answered questions about our data stack. It cited sources. It felt magical.

Then I connected it to our actual knowledge base.

What Broke at Scale

The 14,000 Document Problem

Our Confluence space had 14,000 pages. Our dbt project had 800 models with descriptions. We had two years of data governance meeting notes, four versions of our data dictionary, and an unknowable number of Google Docs that people had shared in Slack channels and never organized.

When the vector database held 200 documents from our curated test set, retrieval was sharp. The right chunks surfaced for the right questions. When I loaded the full corpus, retrieval quality degraded in ways I could feel but initially couldn't measure.

A query about "revenue definition" that previously returned our canonical metric documentation now returned a mix of meeting notes where someone mentioned revenue, a slide deck from 2021 with an outdated formula, and a Jira ticket where an engineer complained about revenue calculations being wrong. The relevant document was still in the database. It just wasn't in the top 5 results anymore.

This is recall drift. As the corpus grows, the embedding space gets crowded. Documents that are semantically adjacent but contextually irrelevant start competing with the documents you actually need. The cosine similarity scores all look fine. The results are all plausible. But plausible isn't correct.

The Temporal Blindness Problem

This was the one that scared me most, because it's the one most directly connected to governance.

Our data governance policies had changed three times in two years. The refund backdating rule changed in Q2 2024. The PII classification standard was updated in Q4 2024. The attribution model switched from last-touch to multi-touch in Q1 2025.

The RAG system had no concept of time. Every chunk was equally valid. When asked "how do we handle refund attribution?" it retrieved chunks from all three policy versions and blended them into a confidently incoherent answer. Part of the answer reflected the 2023 policy. Part reflected the 2025 policy. The seams were invisible.

A human reading those documents would notice the dates, understand that the newer document supersedes the older one, and apply the current policy. The RAG system couldn't do this because the chunks had been stripped of their temporal context during the chunking process. The 512-token window didn't include the document header that said: "Effective January 2025, this policy replaces..."

This is the "lost in the middle" problem applied to governance, and it's worse than the standard version. In customer support, returning an outdated answer is embarrassing. In data governance, returning an outdated policy as if it were current is a compliance risk.

The Authority Problem

Not all documents are equal. Our canonical data dictionary should outweigh a random Confluence page where someone brainstormed metric definitions. A formally approved governance policy should outweigh meeting notes where someone proposed a policy change that was never adopted.

Vector similarity doesn't encode authority. The embedding model doesn't know that one document was approved by the head of data governance and another was a draft in someone's personal folder. It knows they're both about "revenue definitions" and they're both semantically close to the query.

I found the system citing unapproved drafts as if they were policy. Twice. Both times, the draft was more recently written than the approved document (because drafts tend to be newer) and used more specific language (because drafts tend to be more detailed). By every signal the retrieval system could measure, the draft was the better match. By every signal that matters for governance, it was the worst one.

What I Fixed

Contextual Prefixing

The single highest-impact change was prepending context to every chunk before embedding. Instead of embedding a raw chunk that said "Returns are accepted within 30 days for full refund," I prepended metadata: "Source: Enterprise Customer Policy v3.2, approved January 2025, owner: Legal/Finance. This chunk is from Section 4: Returns and Refunds."

This cost more to process (each chunk required an LLM call to generate the prefix) and increased storage by roughly 30%. It also dramatically improved retrieval relevance. The system stopped confusing the 2023 policy with the 2025 policy because the embeddings now encoded the temporal and authority context, not just the semantic content.

Hybrid Search

Pure vector search was failing on exact terminology. When someone asked "what's the SLA for the customer_transactions table?" the vector search returned chunks about SLAs in general, data quality in general, and customer data in general. It missed the specific document that mentioned customer_transactions by name because the embedding didn't prioritize exact string matches.

Adding BM25 keyword search alongside vector search fixed this immediately. The hybrid approach uses vector search for conceptual matching (finding documents about data freshness when someone asks about "stale data") and keyword search for precise lookups (finding the exact table name, metric name, or policy reference).

Document Authority Scoring

I added a metadata field to every document: authority_level. Approved policies got a score of 1.0. Official documentation got 0.8. Meeting notes got 0.4. Drafts got 0.2.

During retrieval, the similarity score is multiplied by the authority score before ranking. A draft that's slightly more semantically relevant than an approved policy will still rank lower. This isn't a perfect solution. It requires someone to maintain the authority tags. But it's better than treating every document as equally trustworthy, which is what every default RAG implementation does.

Temporal Filtering

For governance queries specifically, I added a filter that only retrieves documents marked as "current" in our document management system. Superseded policies are tagged as archived and excluded from the retrieval index for governance questions. They're still available for historical queries ("what was our refund policy in 2023?"), but they don't contaminate answers about current policy.

This required integration with our document management system, which was the most engineering-intensive fix. But it solved the temporal blindness problem completely for the 40% of queries that were about current policies and procedures.

The Scorecard

Problem	Default RAG Behavior	After Optimization
Recall drift at scale	Retrieved plausible but irrelevant chunks	Hybrid search + authority scoring prioritized canonical sources
Temporal blindness	Blended outdated and current policies	Contextual prefixes + temporal filtering separated policy versions
Authority confusion	Treated drafts and approved docs equally	Authority scoring downranked unofficial sources
Exact term matching	Missed specific table/metric names	BM25 keyword search caught precise references
Latency on repeat queries	Every query hit the full pipeline	Semantic caching cut response time by 60% for common questions

What I Learned That I Didn't Expect

RAG is a governance problem, not just an engineering problem. The technical challenge of building a RAG system is real but solvable. The governance challenge of ensuring the system retrieves authoritative, current, contextually appropriate information is harder and mostly unaddressed by the standard RAG tutorials. Every optimization I made was about trust, not performance. Authority scoring, temporal filtering, and contextual prefixes. These are governance interventions dressed up as engineering features.

The "demo to production" gap is enormous. My prototype worked in three days. The production system took four months. The difference was entirely about edge cases that only appear when you connect to real, messy, contradictory, evolving enterprise knowledge. If your RAG demo uses a curated knowledge base, you haven't tested RAG. You've tested retrieval on clean data, which is a different and much easier problem.

Chunking destroys the context that governance requires. The standard RAG pipeline breaks documents into chunks and embeds them independently. This is fine for factual retrieval. It's dangerous for governance retrieval, because governance depends on context: when was this written, who approved it, does it supersede something else, and what scope does it apply to. All of that context gets lost at the chunk boundary unless you deliberately preserve it.

Your RAG system inherits every problem your knowledge base already has. We had four versions of the data dictionary. We had meeting notes that contradicted approved policies. We had Confluence pages last updated in 2021 that were still technically "current." RAG didn't create these problems. It scaled them. It took inconsistencies that a careful human would catch and delivered them at the speed and confidence of an AI that doesn't know what it doesn't know.

What I'd Tell My Past Self

Before you build the retrieval system, fix the knowledge base. Tag every document with an owner, a date, an authority level, and a status (current, superseded, draft). If that sounds like a lot of work, it is. It's also the work that determines whether your RAG system gives trustworthy answers or confidently plausible ones.

And if anyone tells you RAG is a solved problem, ask them how many documents are in their vector database. If the answer is under a thousand, they haven't met the problem yet.

The retrieval is the easy part. The trust is the hard part. And in enterprise analytics, trust is the only part that matters.