Yet another “control plane”! Why?

Anyone who has built or managed platforms at scale must be familiar with control planes, but maybe not in the context of AI. Typical control planes that you might know about:


LLM/Agentic/AI applications need the same idea because the ‘request’ isn't only an HTTP call. It’s prompt, often with retrieved context, a chain of tool calls, and non-deterministic outputs – plus security risks like prompt injection and sensitive data exfiltration.


So, treat prompts, model calls, retrieval, and tools as first-class production components that are instrumented and governed like microservices and data.

What is the AI Control Plane?

The AI control plane is a shared platform layer that sits on the execution path of your LLM workloads to provide:


  1. Observability: traces, metrics, logs for model and tool calls
  2. Quality controls: evals, canaries, regression tests, drift detection
  3. Policy enforcement: data/PII rules, tool permissions, safety filters, schema checks
  4. Cost management: budgets, attribution, rate limits, token quotas, showback/chargeback
  5. Operational safety: circuit breakers for agent loops, caching, fallbacks


What is it not: your app logic. If your “control plane” contains app decision logic, it becomes a bottleneck. Keep it thin, consistent and ubiquitous.

Start with a request “envelope”/metadata (the minimum contract)

Similar to a standard header set for your microservices, define what every LLM interaction must carry.


Example header fields:


Example header:

{
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"tenant_id": "acme-co",
"user_id_hash": "u_8f3b…",
"conversation_id": "conv_2026_01_18_001",
"prompt_id": "support_reply",
"prompt_version": "v17",
"policy_profile": "support-prod",
"model_route": "primary:gpt-4.x
fallback:gpt-4o-mini", 
"data_classification": "confidential", 
"budget_key": "team:support feature:assistant",
"env":"prod", 
"release_version": "2026.01.18"
}


This “envelope”/”header” becomes the unique key across traces, evals, incidents, and costs.


Observability: instrument prompts like microservices

Distributed tracing: every model call is a span, and every tool is a child span

LLM app is a distributed system:


So instrument it with the same discipline:


Use standard semantics where possible, and for cross-service correlation, propagate trace context headers.


1) Practical span attributes you’ll actually use


Note: content capture (full propmpts/responses) is a high risk.


Code Example: wrapping an LLM call with OpenTelemetry (python pseudocode)

import time
from opentelemetry import trace, metrics

tracer = trace.get_tracer("ai.app")
meter = metrics.get_meter("ai.app")

token_usage = meter.create_histogram("gen_ai.client.token.usage", unit="{token}")
op_duration = meter.create_histogram("gen_ai.client.operation.duration", unit="s")
def call_llm(envelope, model, messages, temperature=0.2, max_tokens=600):
  start = time.perf_counter()
  
  span_name = f"chat {model}"
  with tracer.start_as_current_span(span_name) as span:
      span.set_attribute("gen_ai.operation.name", "chat")
      span.set_attribute("gen_ai.provider.name",
      envelope["model_route"].split()[0].split(":")[1])
      span.set_attribute("gen_ai.request.model", model)
      span.set_attribute("gen_ai.conversation.id", envelope["conversation_id"])
      span.set_attribute("prompt.id", envelope["prompt_id"])
      span.set_attribute("prompt.version", envelope["prompt_version"])
      span.set_attribute("budget.key", envelope["budget_key"])
      
      # DO NOT store raw messages/content by default.
      resp = llm_client.chat (messages=messages, temperature=temperature, max_tokens=max_tokens)
      dur = time.perf_counter() - start
      span.set_attribute ("gen_ai.usage.input_tokens", resp.usage.input_tokens)
      span.set_attribute ("gen_ai.usage.output_tokens", resp.usage.output_tokens)
            
      # Metrics aligned with GenAI conventions
      attrs = {
        "gen_ai.operation.name": "chat",
        "gen_ai.provider.name":"openai",
        "gen_ai.request.model":model
      }
      op_duration.record(dur, attributes=attrs)
      token_usage.record(resp.usage.input_tokens, attributes={**attrs, "gen_ai.token.type": "input"})
      token_usage.record(resp.usage.output_tokens, attributes={**attrs, "gen_ai.token.type": "output"})
      return resp.output_text


2) Logging: avoid the raw text but keep the ‘why'


Traditional logs are usually safe because requests are mostly structured and expected.


LLM logs are a bit different. Prompts can contain:


So the control plane should support tiered logging:

Tier A (always on, safe metadata)


Tier B (redacted/samples, controlled by incident workflow)


Tier C (never in central logs)

PII redaction: build a “PII firewall” (and don’t trust it blindly)

PII redaction should happen in two places:

  1. Before the model call (don’t send what you shouldn’t)
  2. Before persistence (don’t store what you shouldn’t)


Some tools (e.g., Microsoft Presidio) can detect/anonymize PII, but there is no guarantee it will catch everything; hence, you still need layered protections.


A good PII redaction pipeline would look like:


Conceptual example:

# Pseudocode only
entities = pii_detector.detect (text)
if policy.requires_redaction (entities):
  text = pii_detector.redact(text)
  span.set_attribute("policy.pii.redacted", True)
  span.set_attribute("policy.pii.types", [e.type for e in entities])


Policy enforcement: guardrails must be coded and not rely on “a better prompt.”

Prompts can help, but prompts are not enforcement. LLMs are inherently probabilistic, and prompts are suggestions to a probabilistic system.


A control plane enforces policy at Policy Decision Points (PDPs) such as:

Enterprise grade set of policies that are worth enforcing

Data policies


Tool policies


Output policies


Rate limiting: tokens are the new performance currency

In microservices, we rate-limit requests. LLM apps must rate-limit following:

If you only rate-limit requests, an attacker (or a buggy agent) can still burn budget by generating huge outputs or looping tool calls.

You should enforce:


FinOps for LLM apps: LLM calls are expensive, cost management isn’t optional

FinOps is an operating model that creates financial accountability through collaboration between engineering, finance, and business teams. FinOps is extensively used in cloud context. However, that definition applies perfectly to LLM applications because costs scale with:

The control plane makes cost a first-class signal

If telemetry captures budget_key, prompt_version, model, and token usage, you can do what mature cloud FinOps teams do:

Low-effort cost wins that don’t compromise on quality


Evals: the must-have quality gate for prompt & tool changes

Microservices use unit tests, contract tests, canaries, and SLOs.

LLM apps need the same, along with content evaluation.


Evaluations (evals) are explicit tests of the LLM output against your expectations; these are essential for reliability, especially when changing models or versions of prompts.

A practical eval stack that works

Level 1: Prompt unit tests (fast, deterministic-ish)


Level 2: Golden-set regression suite (CI gate)


Level 3: Shadow evaluation in production


Level 4: Human-in-the-loop for high-stakes domains


Pro tip: treat eval datasets like code. Version them. Review changes. Track coverage by scenario.

Drift detection: what changes even when you don’t deploy

LLM systems drift for reasons that don’t show up in Git history:


Use standards such as NIST’s AI RMF (and its generative AI profile) as a reference point for thinking about operational risk over time and not just “does it work today?”

Drift signals worth monitoring that’ll provide key insights


A good control plane turns drift into dashboards and alerts, not vague “users say it feels worse.”

Reference architecture: how the AI Control Plane fits

Here’s a vendor-neutral conceptual architecture that can be adopted:

If you’re already managing microservices with OpenTelemetry, this is a natural extension, just with additional guardrails. LangFuse is a good open-source framework to get started on the AI control plane.

Conclusion: you can’t scale what you can’t explain

Most “LLM incidents” aren’t model bugs. They are LLM app bugs that are missing trace context, unknown prompt versions, unmetered tool loops, or a lack of policy enforcement boundaries. An AI control plane doesn’t improve the intelligence of a model. What it does is make the overall system manageable and reliable, which is what allows teams to run and scale LLM applications safely.