GenAI demos are impressive but the production systems are expensive. What most teams do not realize until it is too late is that the cost is not coming from the model, rather it is coming from everything you send to it.

We ran into this the hard way. The system was working well, users were happy, and nothing major had changed. Yet the costs kept climbing. At first, we assumed it was the model. We tried switching models, tweaking parameters, and reducing output tokens, but nothing made a meaningful difference. The real issue was something far less obvious. It was the context.

There is a common assumption that more context leads to better results. So over time, systems start accumulating information. Full chat histories get passed along, multiple retrieved documents are included, logs and metadata sneak into prompts, and instructions grow longer with every iteration. None of these decisions seem wrong on their own. In fact, they often improve quality in the short term. But together, they create a quiet and persistent problem.

When we finally measured what was being sent to the model, the results were surprising. A large portion of the tokens had little to do with the actual user question. We were not just sending context. We were sending noise, and paying for it every time.

Where Context Bloat Actually Comes From

If you zoom out, most GenAI systems follow a simple flow, but the problem is what gets added along the way.

At first glance, each step makes sense. You want memory, relevant documents, system awareness, and structured prompts. The problem is that none of these layers are designed to remove information. They only add to it.

Chat history grows with every interaction, even when older turns are no longer relevant. Retrieval systems return multiple chunks that often overlap semantically, repeating the same idea in different ways. Logs and metadata get included β€œjust in case,” even when they are never used. Prompts evolve over time, accumulating instructions that were added for edge cases but never removed.

By the time the request reaches the model, it looks less like a focused question and more like a collection of everything the system has ever seen.

What makes this worse is that modern GenAI systems are not single-step. A typical request may trigger multiple model calls for routing, reasoning, and tool usage. That means this bloated context is not just sent once. It is sent repeatedly, multiplying the cost across the entire workflow.

Fixing Context Bloat (What Actually Worked)

The fix was not about tuning the model, it was about changing the shape of the system. We redesigned the flow to treat context as something that must be earned, not assumed.

The biggest shift was introducing a filtering mindset early in the pipeline. Instead of passing everything forward, we asked what was truly necessary for this specific request.

Chat history was no longer treated as a full transcript but as a source of relevant signals. Only the most recent or meaningful interactions were included. Older context was either dropped or summarized into short, high-value representations.

Retrieval became more deliberate. Instead of pulling in multiple large chunks, we focused on bringing in only what added new information. Redundant or overlapping content was eliminated. In many cases, we found that fewer, more precise inputs produced better results.

We also replaced raw context with compressed meaning. Long passages were summarized, logs were converted into structured signals, and prompts were simplified to remove legacy instructions. The goal was not to reduce information, but to remove unnecessary form.

Once we made these changes, the impact was immediate. Token usage dropped significantly, costs came down, and response times improved. More importantly, the model performed better because it was no longer distracted by irrelevant inputs.

What Changed in How We Think

This experience forced us to rethink a fundamental assumption. GenAI systems are not just about intelligence, they are about control.

Every token sent to a model is a decision and most systems are making too many of them without realizing it. The most important lesson we took away from this is simple, the most expensive token is the one that did not need to be sent.