As AI agents move from prototypes to production systems, a recurring challenge has emerged: how do you enforce safety, optimize performance, and handle sensitive data consistently across every inference call — without scattering that logic throughout your agent code?
This is the problem we set out to solve with AutoAgents, an open-source AI agent framework written in Rust. Our latest feature, LLM Pipelines, introduces composable middleware layers for LLM inference — an approach inspired by how the web server ecosystem solved similar cross-cutting concerns years ago.
In this post, I'll walk through the architecture, explain the design decisions behind it, and share where we think this pattern can help teams shipping AI agents to production.
The Core Idea: Middleware for LLM Inference
Web frameworks solved the problem of cross-cutting concerns — authentication, logging, compression, rate limiting — with middleware. Tower in the Rust ecosystem, Express in Node, and Django in Python all use this pattern. You wrap your core service with composable layers, each handling one responsibility.
LLM inference has a remarkably similar set of cross-cutting concerns. Production agents often need response caching, input sanitization for sensitive data, protection against prompt injection, and observability. These requirements cut across every inference call regardless of what the agent is doing.
In AutoAgents, you compose these layers into a pipeline:
rust
use autoagents::llm::pipeline::PipelineBuilder;
use autoagents::llm::optim::{CacheConfig, CacheLayer, ChatCacheKeyMode};
use autoagents_guardrails::guards::{PromptInjectionGuard, RegexPiiRedactionGuard};
use autoagents_guardrails::{EnforcementPolicy, Guardrails};
let llm = PipelineBuilder::new(any_llm_provider)
// Layer 1: Cache repeated queries
.add_layer(CacheLayer::new(CacheConfig {
chat_key_mode: ChatCacheKeyMode::UserPromptOnly,
ttl: Some(Duration::from_secs(900)),
max_size: Some(512),
..Default::default()
}))
// Layer 2: Enforce safety guardrails
.add_layer(
Guardrails::builder()
.input_guard(RegexPiiRedactionGuard::default())
.input_guard(PromptInjectionGuard::default())
.enforcement_policy(EnforcementPolicy::Block)
.build()
.layer(),
)
.build();
The resulting llm implements LLMProvider — it can be passed to any agent, and every inference call flows through the layers automatically. Safety and performance become structural properties of the system rather than responsibilities left to individual developers.
Why This Pattern Matters for Local Model Deployments
Cloud LLM providers typically include their own content filtering and safety layers. When teams choose to run models locally — for data sovereignty, air-gapped environments, cost optimization, or edge deployments — those provider-side protections are no longer present.
This creates an important gap: deployment scenarios with some of the strictest security requirements (regulated industries, government, healthcare) often have the fewest built-in safety mechanisms when running local models.
AutoAgents addresses this by making the pipeline provider-agnostic. The same layers work identically whether the underlying provider is llama.cpp running a local Qwen model or a cloud API:
rust
// Local model via llama.cpp
let local_provider = LlamaCppProvider::builder()
.model_source(ModelSource::HuggingFace {
repo_id: "Qwen/Qwen3-VL-8B-Instruct-GGUF".to_string(),
filename: Some("Qwen3VL-8B-Instruct-Q8_0.gguf".to_string()),
mmproj_filename: Some("mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf".to_string()),
})
.n_ctx(4096)
.max_tokens(256)
.temperature(0.2)
.build()
.await?;
// Same pipeline, same safety guarantees
let llm = PipelineBuilder::new(Arc::new(local_provider) as Arc<dyn LLMProvider>)
.add_layer(cache_layer)
.add_layer(guardrails_layer)
.build();
No code changes between local and cloud. Teams can develop against a cloud provider and deploy with a local model (or vice versa) without modifying their safety configuration.
The Value of Composability
Different applications have different safety and performance profiles. A medical records assistant may need aggressive PII redaction but minimal injection protection. A public-facing chatbot may need the opposite. An internal development tool might primarily benefit from caching.
Rather than offering a single "safe agent" abstraction, the pipeline pattern lets teams compose exactly what they need:
rust
// Healthcare application: PII focus
PipelineBuilder::new(provider)
.add_layer(cache)
.add_layer(Guardrails::builder()
.input_guard(RegexPiiRedactionGuard::default())
.enforcement_policy(EnforcementPolicy::Block)
.build().layer())
.build();
// Public chatbot: injection protection focus
PipelineBuilder::new(provider)
.add_layer(cache)
.add_layer(Guardrails::builder()
.input_guard(PromptInjectionGuard::default())
.enforcement_policy(EnforcementPolicy::Block)
.build().layer())
.build();
This also means teams can introduce safety incrementally. Start with caching. Add PII redaction when compliance requires it. Add injection protection when the agent gains tool access. Each addition is one .add_layer() call — no refactoring required.
Why We Chose Rust
Our choice of Rust is grounded in the practical requirements of production AI systems.
AI agents are typically long-running processes handling many concurrent requests. Predictable latency — especially at the tail — matters for user experience. Rust's lack of garbage collection pauses gives us deterministic performance characteristics. The pipeline layers add nanoseconds of overhead rather than milliseconds.
For edge deployments — industrial automation, medical devices, embedded systems — resource constraints make framework overhead a real consideration. Rust's memory efficiency lets more of the available hardware budget go toward model inference rather than runtime overhead.
The type system also provides compile-time guarantees about pipeline configuration. If a guard doesn't implement the required trait, the code won't compile. We prefer catching configuration errors during development rather than discovering them in production.
Current Limitations and Roadmap
We believe in being straightforward about where the project stands today.
The pipeline architecture is just released and we're confident in its design. The individual layer implementations are earlier in their maturity. Here's what we're focused on next.
Guard sophistication will improve over time. We're working toward NER-based PII detection and classifier-based injection detection for teams that need broader coverage beyond the current pattern-based approach.
Rate limiting and backpressure for multi-tenant deployments where many users share a single local model.
Custom guard traits with a clean public API, so teams can implement domain-specific guards — toxicity filters, regulatory compliance checks, output validation — and integrate them into the pipeline seamlessly.
We're building in the open and our roadmap is shaped by community feedback.
Getting Started
The full working example runs a local Qwen3-VL-8B model with caching, PII redaction, and prompt injection protection:
git clone https://github.com/autoagents-ai/autoagents
cd autoagents
cargo run --example safe_local_optimizer
The pipeline works with any provider AutoAgents supports — llama.cpp, Ollama, OpenAI, Anthropic, and others. The broader framework includes memory management, tool use, and multi-agent orchestration.
If this approach to building production AI systems resonates with you, we'd love to hear your perspective. Star the https://github.com/liquidos-ai/autoagents to follow our progress, and feel free to open an issue if there are guardrails or pipeline features that would be valuable in your work.
We're always interested in learning what safety and performance challenges teams are encountering as they move agents from prototype to production. Those conversations help us build something genuinely useful.
AutoAgents is an open-source AI agent framework written in Rust, designed for production deployments where performance, safety, and reliability are essential requirements. The project is in active development with a growing community of contributors.