As we are building more and more real-world agentic technologies from controlled prototypes, security guardrails are fundamental for business use cases. Prototypes built by research scientists and software engineers do not adhere to the security/privacy standards that a scaled product needs when dealing with customers’ personal information like credit card details, address, age, etc.
In traditional software, the boundaries for accessing the data layer, other service API endpoints are all explicit and can be tuned based on resource/action policies. Hence, there are deterministic outcomes when developing the service. However, in an AI agent-driven workflow, the line blurs, and sometimes it's also called “fuzzy-logic”. A single model can summarize sensitive text, access private or internal APIs based on multiple user contexts, without realizing that this information should remain confidential. To achieve complete autonomous agents performing product operations, privacy design is not optional, and becomes the most important feature.
The power of AI agents raises a key question: how do you stop an intelligent system from doing something it is not supposed to do?
The Risks of Autonomy
Most developers worry about hallucinations or poor reasoning in LLMs (Large Language Models). Very few realize that the same behaviors can quietly leak sensitive data.
For example, in a production system, let’s assume the agentic system helps you book your favorite NBA team’s basketball game ticket. As a software company that makes that application, would you rather fail the ticket purchase due to hallucination but no customer information leak, or a successful purchase at the cost of a data leak?
This can happen in a few ways:
1. Context persistence: Agents often cache prior inputs for coherence. Hence, it is important to manage caches that store private tokens, emails, or identifiers longer than intended.
2. Cross-tenant drift: When an agent is running for multiple customers or domains, it can lead to context overlap that exposes sensitive data between users.
3. Prompt injection and jailbreaks: A prompt attack can manipulate the agent’s input to be revealed or alter the input itself from the intended one. There needs to be guardrails to help the agent differentiate between user prompts and developer guidelines.
4. Third-party dependencies: External APIs used by the agent may log or store payloads without the user’s knowledge.
Principles of Privacy-by-Design
Similar to the best practices for a distributed software system, we need standards for privacy-based architecture for agentic systems. The most important principles used today:
1. Isolation by Context: Each agent instance should operate in a short-lived sandbox, an independent execution and memory space. This is very similar to any EC2 container in maintaining a consistent environment. Context can be passed explicitly, and not implicitly. When the task ends, memory is destroyed without any persistence duration.
2. Token-Level Redaction: Before any data reaches the model layer, it must pass through a sanitization filter. The filters include named entity recognition or lightweight regex scanners to identify common industry patterns like credit card numbers, email addresses, or access keys. These can be replaced with placeholders, which do not preserve any meaning or context, thus removing data sensitivity.
Def Text = re.sub(p,"[REDACTED]",text) Return text |
Even a simple pre-processor like this dramatically reduces downstream risk.
The above pre-processor function:
- Defines a function called redact_sensitive_data which takes an input string called “text”.
- Creates a list of regex patterns to search for:
- 16-digit numbers (possible credit card numbers
- email addresses
- Loops through each regex pattern.
- For each pattern, uses re.sub() to:
- Find all occurrences in the text
- Replace them with "[REDACTED]"
- After processing all patterns, it returns the sanitized text as the output.
3. Ephemeral Data Pipelines: Maintain a configuration for data persistence duration after which any logs, embeddings, and temporary caches will expire automatically. It is a common theme throughout the industry, engineers and scientists retain data for a longer time than needed in order to debug and train models, respectively. A strict configuration helps remove that issue.
4. Policy Enforcement as Code: Instead of having all the privacy logic deep inside the package/modules, it is a good idea to implement it as a dedicated policy engine. The engine evaluates every operation with set rules defined by engineers/scientists or policies defined by the product regarding who can access what, for how long, and under which role.
5. Output Validation: Similar to the input sanitization, an output from an agentic system should also be filtered before exiting to other systems or external public APIs. This is to maintain data/privacy policies
Layered Architecture for Safe Autonomy
A practical way to think about privacy guardrails is as a reverse proxy around your agent. You can visualize them using the following layers:
- Input Filter: Its simply a filter that can mask or reject sensitive data.
- Policy Engine: A rule engine that checks user permissions and audit rules.
- Execution Sandbox: This is like a containerized environment where the agent runs with scoped, temporary credentials.
- This layer can be broken down even further as the LLM layer, planning layer and the act layer as well
- Output Filter: Similar to input filter, it redacts or flags responses that contain restricted information.
- Observability Layer: Collects data for monitoring the system
Cascading effect of corruption in Agentic AI Systems
Even with multi-layer filtering in place, we still see data leaks like Salesforce’s Agent Force.
If one layer is compromised by an incorrect configuration, prompt injection, or unexpected API behavior, it can lead to failure that ripples outward and contaminates the entire pipeline.
Effects of a compromised software system
Layer | If Compromised | System-Wide Effect |
Input Filter | Malicious or sensitive data passes undetected | LLM memorizes or exposes confidential tokens |
Policy Engine | Elevated permissions | Unauthorized actions by the agent |
LLM | Corrupt input can shift reasoning or output bias | All downstream steps inherit corrupted logic |
Orchestrator/ Planner | Workflow routing hijacked | The agent executes incorrect or dangerous sequences |
Output Filter | No validation of generated text | Sensitive data or harmful content reaches end users |
Observability Layer | Raw data logged persistently | Compromised data re-enters training or analytics loops |
As discussed in the above section multiple layers of security provide "defense in depth". But as these layers are sometimes tightly coupled based on shared vector store or session memory, cookies, etc, a compromise at one of these layers can cause unintended access to the rest.
Guardrails as Containment Walls
Privacy guardrails provide a way to limit the blast radius of such a failure. They are similar to circuit breakers that protect other components in the system when one component overheats from a power surge. It is like de-coupling the system at runtime, also known as "andon chord" in the software industry.
A robust design ensures that:
- Each layer is able to operate in a sandboxed context.
- Every interaction at each of the layers can have data contracts with strict schemas to define what can be passed forward. It is like an API contract between services.
- Uncertainty defaults to blocking, not allowing, an operation.
- Every layer emits telemetry as part of the observability layer that enables anomaly detection and rollback.
In other words, guardrails don't just prevent bad behavior; they contain it.
Testing Guardrail Effectiveness
Any good software system should have a feedback loop that collects certain metrics to determine if the system is running optimally. It is the same when it comes to the yearly maintenance of your car. The technician will go through the error codes on your car and determine if anything needs to be replaced. Similarly, a good security system needs to be tested very often with real-world data and synthetic data to measure the performance of the guardrails. Bad actors are always looking to compromise software systems, and to stay ahead, one needs to train the guardrails with more data.
- Redaction accuracy: This is the percentage of sensitive data that was correctly masked by the system. The higher the number, the more accurate the guardrail performance is.
- Latency overhead: The time added per request due to the additional privacy guardrails.
- False positives: Did the system incorrectly mask, censor data that was not harmful and needed for the software system in the next layer? False positives are always tricky and challenging to fix.
The goal of any software engineer is to develop a system that optimizes for both precision and performance.
A Real World Example:
To scale agentic solutions in the real world, engineers need to strike a good balance between deterministic guardrails similar to rule engine and non-deterministic guardrails. For example, if there are terms and conditions being checked while logging in to create a profile, these highly important actions should be protected by both non-deterministic and deterministic guardrails. This two-layered approach can save incorrect action being performed by agent hallucination.
Some of the lessons that I learnt
I learn the following after deploying and scaling an agentic system:
- Guardrails helps the engineering team to deploy features at scale without worrying about the security implications and bad actors.
- Privacy needs to be audited frequently and the benchmark for the guardrails needs to be updated almost every week. This refers to both the testing of the service with new data but also trying to fool the existing guardrail.
- Treat sensitive data carefully. Use synthesized sensitive data correctly for testing purposes in a non-production environment. In production, mask the logs as much as possible and if logged use a cut-to-red logging system.
Building trust is the real competitive advantage. Users, regulators, and product teams can all move faster when they know the system is designed to fail safely.
Closing Thoughts
With this shift towards agentic technology, the decision-making layer has been abstracted, and it is no longer just software that follows instructions based on some static set of if/else conditions. This increases the autonomy of each piece of code one writes and can lead to non-deterministic outcomes from the intended code. We could potentially see more software reliability concerns, privacy leaks, both in terms of data and intellectual property. Privacy guardrails, including clear moral and technical boundaries, become very crucial during the design phase of a software architecture. The future of AI will be determined by "trust" of agentic systems at scale with a robust set of security features.
References:
https://arxiv.org/abs/2306.05499
https://arxiv.org/abs/2406.07904
https://arxiv.org/pdf/2405.14478