I've spent the last two years talking to engineering teams that ship AI agents to production. Startups, enterprises, teams across BFSI, healthcare, fintech, voice AI, analytics. Different industries, different use cases, different budgets.
The pattern is almost always the same.
Building the AI agent? Relatively straightforward. Getting it to production with confidence? That's where everything breaks.
Teams aren't struggling with models anymore. GPT-4, Claude, Gemini, Llama. The models are good enough for most production use cases. The bottleneck has shifted. It's no longer "can the model do this?" It's "can we trust what we built enough to put it in front of real users?"
And the tooling to answer that question is a mess.
The Current State of AI Agent Tooling
Let me walk through what a typical AI agent stack looks like in production today. If you ship agents, you'll recognize this immediately.
Tracing and Observability
You need to see what your agent is doing. What prompts it's sending, what responses it's getting, where it's failing. Tools like Langfuse, handle this. They're good at it. You pick one, integrate it, and you can finally see inside the black box.
Evaluation
You need to test your agent's outputs against quality benchmarks. Are the responses accurate? Relevant? Safe? Some teams build custom eval scripts. Many teams do this manually. Someone reading outputs and checking a spreadsheet.
Guardrails and Safety
You need to prevent your agent from saying something it shouldn't. Guardrails AI, NVIDIA's NeMo Guardrails, and Llama Guard offer different approaches. Some focus on input validation, others on output filtering. Most teams integrate one of these partially, often as a last-minute addition before launch.
Prompt Optimization
Your prompts need to get better over time. DSPy offers programmatic optimization. But for most teams, this process looks like one person with good instincts and a Google Doc, manually tweaking prompts based on failures they noticed.
Simulation Testing
Before your agent meets real users, you should test it with simulated scenarios. Adversarial inputs, edge cases, high-volume stress tests. This is where the landscape gets thin. Very few teams do this systematically. Most do ad-hoc testing, or skip it entirely and discover failures in production.
Production Monitoring
Once your agent is live, you need to know when things go wrong. This often overlaps with tracing but extends to cost tracking, latency monitoring, and quality degradation detection. Some teams repurpose their tracing tool. Others build dashboards from scratch.
The Problem Isn't the Tools. It's the Gaps.
AI agents don't fail in one layer. They fail across layers.
A hallucination happens because of a retrieval context issue (which your tracing tool could surface), that you'd have caught with the right eval (which lives in a different tool), that your guardrails didn't block because they're not connected to your tracing (separate system), and you discover it when a user reports it (because your monitoring isn't connected to any of the above).
No single tool sees the full failure chain. Each one sees its slice. The failures live in the seams between them.
And this is what most teams are dealing with: 4-5 disconnected tools, each doing its job well in isolation, collectively leaving massive blind spots in the gaps.
This Feels Familiar
If you've been in tech long enough, you've seen this before.
This is the state of devops circa 2010. Jenkins for CI. Nagios for monitoring. Puppet for config management. Capistrano for deployments. Each tool was great. The integration tax was killing teams. Then platforms emerged and the entire ecosystem consolidated around tools that connected the full pipeline.
AI agent infrastructure is in that pre-consolidation phase right now. The tools exist. They're good. But we're still in the "five bash scripts held together with cron jobs" era of putting it all together.
What Needs to Happen Next
I think the AI agent stack needs the same thing the devops stack needed: platforms that connect the full lifecycle.
Not one tool for tracing. One for evals. One for guardrails. One for monitoring. But a connected system where a guardrail firing maps back to which eval would have caught it. Where a production failure automatically generates a test case. Where simulation testing and production monitoring share the same context.
One loop. Build, evaluate, optimize, simulate, observe, protect. Connected.
The second thing that needs to happen: whatever platform emerges should be open.
The AI safety layer, the part that decides what your agent can and can't say to your customers, should be the most transparent, auditable, and self-hostable part of your stack. Not the most locked down. Your database is open source. Your auth is open source. Your infrastructure is open source. The layer governing AI behavior should be too.
This consolidation is inevitable. The question is whether the platform that wins will be open or closed.
I have strong opinions on which one it should be. And I'm not just writing about it.
More on that soon.