Traditional CI/CD pipelines are built on a comforting lie:

If output == expected, ship it.

AI agents ruin that illusion.

LLM-powered agents don’t produce outputs — they produce distributions. Ask the same question twice and you might get:

So how do you build a green CI/CD pipeline for a system that is probabilistic by design?

This post is about how I’ve learned to test non-deterministic agents without neutering them, lying to myself, or turning CI into a flaky nightmare.


The Core Problem

Traditional tests assume:

AI agents offer:

If you test them the old way, you’ll end up with:

So we need new testing primitives.


Principle #1: Stop Testing Outputs. Start Testing Behavior.

Instead of asking:

“Did the agent say exactly this?”

Ask:

This mindset shift is everything.


LLM-as-a-Judge: Let AI Test AI

One of the most practical patterns I’ve used is LLM-as-a-Judge.

The idea

This sounds sketchy until you realize:

What the judge checks


Example: LLM-as-a-Judge in CI

def judge_response(task, agent_output):
    rubric = f"""
    You are grading an AI agent.

    Task:
    {task}

    Agent Output:
    {agent_output}

    Score the response from 1 to 5 on:
    - Correctness
    - Completeness
    - Constraint adherence
    - Hallucination risk

    Return JSON only:
    {{
      "score": <int>,
      "reason": "<brief explanation>"
    }}
    """

    judge_response = call_llm(
        model="gpt-4o",
        prompt=rubric
    )

    return parse_json(judge_response)

Then your CI assertion becomes:

result = judge_response(task, agent_output)
assert result["score"] >= 4

You’re no longer testing exact words.
You’re testingquality thresholds.


Semantic Similarity > String Equality

String equality checks are useless for LLMs.

Instead, test meaning.

What this looks like

Example: Semantic Similarity Assertion

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_match(expected, actual, threshold=0.85):
    emb_expected = model.encode(expected, convert_to_tensor=True)
    emb_actual = model.encode(actual, convert_to_tensor=True)
    similarity = util.cos_sim(emb_expected, emb_actual)
    return similarity.item() >= threshold

CI test:

assert semantic_match(
    expected="User must reset password via email",
    actual=agent_output
)

This lets:

…without breaking the build.


The Flaky Test Nightmare (and How to Survive It)

Non-determinism creates flaky tests by default.

If you pretend otherwise, your CI will slowly become ignored.

The real trade-off

You can’t eliminate this tension.
You can onlymanage it.


Patterns That Actually Work

1. Multiple Runs, Majority Vote

Run the same test 3–5 times.

scores = [judge_response(task, run()).score for _ in range(5)]
assert sum(s >= 4 for s in scores) >= 4

You’re testing stability of behavior, not single outputs.


2. Test Invariants, Not Answers

Examples of invariants:

These are binary, even if outputs aren’t.


3. Tiered CI Gates

Not every test needs to block deploys.

Example:

CI doesn’t have to be cruel to be useful.


What This Changed for Me

Before:

After:

The goal isn’t perfection.
It’sconfidence under uncertainty.


Final Thought

Testing AI agents isn’t about forcing determinism.

It’s about answering one question honestly:

“Did this agent behave acceptably under uncertainty?”

If your CI/CD pipeline can answer that,
you’re already ahead of most teams.