What tools do you need to build an AI agent?

It goes well beyond just the LLM. While the AI takes center stage, the tooling to create robust agents is more about infrastructure than intelligence. You need databases that can keep up with agent workloads, orchestration to handle multi-step processes, ways to verify that agents aren’t making mistakes, and systems to monitor everything in production.

This stack can contain dozens of tools, each with its own trade-offs and integration requirements. Here, we want to go through each within the framework of how they’ll work within your agent.

The AI Agent framework

We could just list out all the tools you might need to build an AI agent, but that won’t be as helpful if you want to develop agents (or agent platforms) that work to a specific use case or production standard. Instead, we need to understand how these tools fit together into a coherent architecture.

AI agents operate in a continuous feedback loop with three distinct phases: gather context, take action, and verify work.

This pattern repeats until the agent completes its task or determines it needs human intervention.

Context layer

The context phase is where your agent gathers information. It needs to retrieve relevant data, load it efficiently into the model's context window, and decide what's important enough to keep. This shouldn’t be just passive data loading, but an active search process where the agent determines what information matters for the task at hand.

Action layer

The action phase is where your agent executes. It takes the context it's gathered and does something with it: calling APIs, writing code, transforming data, or triggering workflows. The key is giving your agent the right capabilities to solve problems flexibly rather than just following rigid scripts.

Verification layer

The verification phase closes the loop. Your agent checks its own work, catches errors, and decides whether to iterate or move forward. Agents that can self-correct are fundamentally more reliable because they catch mistakes before they compound.

This is all wrapped within an infrastructure that handles orchestration, monitoring, and scaling, ensuring your agent runs reliably in production.

Now, let's break down what you need at each layer.

Context layer tooling

The tools for the context layer are all about storing and retrieving information. The right tools depend on what kind of data you're working with and how your agent accesses it.

Databases for transactional data

Agents need OLTP databases that can provision instantly when spinning up new projects or user sessions, handle unpredictable bursty workloads that idle most of the time then spike suddenly, provide isolated environments for testing queries or schema changes, and support multi-agent architectures where each agent or domain gets its own database.

Full-stack agent platforms like Replit or Lovable run their backends on Neon or Supabase. For simpler platforms, SQLite might work well for embedded, single-user agents.

Vector databases

If your agent requires specialized vector workloads at massive scale with optimized indexing, you can lean on dedicated databases like:

These make sense when you're working with billions of vectors or need advanced filtering. For most agents, keeping everything in a single database system wil reduce complexity and latency. A popular option is to skip the use of a specialized vector database altogether and instead opt for handling vector search alongside transactional data through the pgvector Postgres extension. Vendors like Neon support it as part of their pre-installed catalog.

Blob storage

If your agent needs access to large files, documents, logs, and media - i.e. reading PDFs for context, processing images, analyzing logs, or storing generated reports and visualization - you might need to integrate with object storage.

Your agent accesses these through APIs, retrieving files on demand and storing outputs. The challenge is managing permissions and costs, especially egress charges.

MCP servers

You’ll also need a standardize ways to connect your agent to external data without writing custom integrations for every service. The Model Context Protocol does this through MCP servers that expose tools and resources in a consistent format. MCP servers connect to various sources, e.g.:

The advantage is standardization. Once your agent knows MCP, it works with any MCP server without needing to learn service-specific APIs. Authentication and calls happen automatically.

For services without MCP servers, you fall back to direct REST APIs, webhooks, or custom connectors. More work, but complete control.

Action layer tooling

The action layer is where your agent executes tasks. Once it has context, it needs to reason about what to do and actually do it. This requires models for intelligence, frameworks for orchestration, and infrastructure for safe execution.

LLM providers

You need a language model as the reasoning engine for your agent. The model interprets context, decides what actions to take, generates responses, and calls tools. Your choice of model determines your agent's capabilities, cost, and latency.

Top-tier models for production agents:

The advantage is capability. These models handle complex reasoning, understand nuanced instructions, and generate high-quality outputs. They support function calling for tool use and maintain coherence across long conversations.

The cost comes from API pricing and latency. Every agent decision requires a model call. High-volume agents can rack up significant token costs. Latency matters for real-time interactions, and these models typically have response times of 1-5 seconds.

For specific use cases, consider:

Most production agents use a mix. Primary reasoning with top-tier models, routine tasks with faster models, and embeddings from specialized models.

Agent frameworks

You need orchestration to manage multi-step workflows, tool calling, and memory. Agent frameworks handle the loop of observing context, deciding on actions, executing tools, and updating state.

Popular frameworks for production:

The advantage is speed. These frameworks handle the boilerplate of prompt construction, tool calling, error handling, and state management. They provide pre-built patterns for common agent workflows.

The problem is abstraction overhead. Frameworks can obscure what's actually happening, making debugging harder. They also introduce dependencies and API changes. Some teams find heavy frameworks too rigid for custom agent logic.

For complex multi-agent systems, consider CrewAI for role-based agent teams or Semantic Kernel for .NET environments. For full control, build custom orchestration using direct model APIs with your own state management.

Workflow orchestration

You need durable execution for multi-step processes. Agents often run tasks that span minutes or hours, call multiple external services, and must handle failures gracefully without losing progress.

Vercel Workflows or Inngest provide event-driven workflow orchestration built for agents. For similar capabilities, Temporal offers workflow-as-code with strong consistency guarantees. AWS Step Functions provides managed state machines integrated with AWS services, while Prefect focuses on data pipeline orchestration with scheduling and monitoring.

Code execution and sandboxing

Agents that write and run code need safe execution environments. You can't let agent-generated code access your production systems or run indefinitely. Sandboxing isolates execution and limits damage from bugs or malicious code.

Primary approaches:

Docker is the standard. You run agent code in ephemeral containers that have no access to the host system, enforce CPU and memory limits, and tear down after execution. Set timeouts to prevent infinite loops. Use read-only filesystems where possible.

Modal and E2B provide managed sandboxing that removes infrastructure overhead. Modal excels at compute-intensive tasks with its serverless GPU access, while E2B focuses specifically on AI agent code execution with pre-configured runtimes.

For serverless environments, AWS Lambda and similar platforms provide built-in sandboxing with execution time limits. These work well for short-lived agent tasks but have constraints on runtime and available resources.

Verification layer tooling

The verification layer ensures your agent isn't making mistakes. Agents can hallucinate, generate broken code, or make poor decisions. You need tools to monitor behavior, test outputs, and catch errors before they reach users.

Observability and evaluation

You need to monitor what your agent is doing and evaluate whether it's doing it well. This means tracing each decision, logging tool calls, measuring output quality, and detecting when performance degrades.

Braintrust provides AI observability and evaluation built for agents. For similar capabilities, Langfuse offers open-source LLM tracing with prompt management. Arize Phoenix provides observability focused on detecting model drift and data quality issues. Custom logging to the ELK stack or Datadog works for teams that need to build their own evaluation logic.

Testing frameworks

You need automated testing to verify agent behavior before deployment. This includes testing that agents handle expected inputs correctly, fail gracefully on edge cases, and maintain consistent quality across prompt or model changes.

Standard testing frameworks work for agents with some adaptation:

The approach is similar to traditional software testing. Create a test suite with representative queries, run your agent against them, and assert that outputs meet quality criteria. The difference is that agent outputs aren't deterministic, so tests often check for semantic correctness rather than exact matches.

Custom eval harnesses provide more sophistication. These run large test sets, use LLM-as-judge to score outputs on fuzzy criteria like helpfulness or tone, and track performance over time. Human review loops add a layer where people evaluate agent outputs, especially for subjective quality or edge cases that automated tests miss.

Linting and code quality

Agents that generate code need validation to catch syntax errors, security issues, and style problems. Running linters on agent-generated code provides immediate feedback that the agent can use to fix mistakes. Language-specific linters catch different issues:

Deployment platforms

You need somewhere to run your agent code that scales with demand, handles failures gracefully, and doesn't require constant babysitting. The platform choice depends on your agent's runtime requirements and usage patterns.

Container orchestration platforms provide the most flexibility:

Serverless platforms remove infrastructure management:

Cloud VMs remain an option for agents that need long-running processes or specific system configurations. Services like AWS EC2, Google Compute Engine, or Azure VMs give you full machine access but require more operational work for scaling and availability.

API gateways

You need a front door for your agent that handles authentication, rate limiting, request routing, and monitoring. API gateways sit between users and your agent, managing all incoming traffic.

Common gateway solutions:

Secrets management

You need secure storage for API keys, database credentials, and other sensitive data. Hardcoding secrets in code or environment variables creates security risks. Secrets management systems provide encrypted storage, access control, and audit logging.

Standard solutions:

For development, environment variables work but aren't suitable for production. Tools like dotenv manage local secrets but lack the encryption and audit capabilities needed for production systems.

The key is never committing secrets to version control and rotating them regularly. Secrets management systems enforce these practices through technical controls rather than relying on developer discipline.

Building your agent stack

Building production agents is less about finding the perfect tool and more about understanding what your agent actually needs at each layer. Start with the basics: a database that provisions fast, an LLM that can reason through your use cases, and observability so you know what's happening. Add complexity only when you need it.

Not every agent needs workflow orchestration or dedicated vector databases. A simple agent might just need Neon for data, Claude for reasoning, and basic logging. A complex multi-agent platform needs the whole stack with queues, sandboxing, and sophisticated monitoring.

The common thread is infrastructure that matches agent behavior. Traditional tools built for steady workloads break down when agents create unpredictable spikes, need instant provisioning, or operate across multiple isolated environments. Start simple, measure what matters, and add tools as your agent's requirements become clear.