How to Bootstrap Agent Evals with Synthetic Queries

Synthetic queries, failure taxonomies, and the iteration loop to catch’em all

Your agent passed the vibe check. Coherent outputs. Grounded citations. No hallucinations. Then it failed in production, and you couldn't figure out why.

The output looked correct. The metrics showed nothing unusual. But something went wrong that output-only evaluation would never catch.

Consider a system that helps doctors write prior authorization appeal letters. It gathers the diagnosis, matches it against coverage rules, and drafts a persuasive argument for the insurer. What takes a physician an hour, this system does in a minute.

The letters have correct diagnosis codes, proper formatting, coverage criteria quoted verbatim. But claims keep getting denied. The system retrieves stale policy documents. The insurer updated their rules in January.

How do you catch this before your users do?

This article introduces a pattern for bootstrapping evals with synthetic query generation. Define the axes along which your agent's behavior varies, generate structured test queries across those axes, and iterate on the failures you find. The pattern works before you have real traffic, and keeps working after.

Why You Need Synthetic Queries

Evals require test data. Early on, you don't have any.

Synthetic queries solve two problems: cold start and diversity gaps. Ask an LLM to "generate 50 test queries," and the results cluster around the same phrasing, the same scenarios, the same complexity level. To cover a wide range of failure modes, you need structure.

The Dimensions → Tuples → Queries Pattern

Your agent's behavior varies along specific axes. For a prior authorization writer, the payer matters. So does the patient's age: pediatric cases have different criteria.

These axes are your dimensions. Each dimension has a set of possible values. The cross product of all dimensions defines your test space.

The pattern works in three steps:

Define dimensions as a schema. Each field represents an axis along which behavior should vary.
Generate or specify options for each dimension. Five payers, three age categories, four complexity levels, ten states.
Combine options into tuples. Convert each tuple to a natural language query.

The cross product guarantees combinatorial coverage. Every combination is reachable, including edge cases like "pediatric + off-label + California + Cigna" that an LLM would never generate on its own.

An alternative: ask the LLM to propose tuples directly. You prompt the model to generate realistic combinations given your dimensions and domain context. This produces more natural data because the LLM avoids implausible pairings. The trade-off: it may miss edge cases the cross product would catch.

Cross-product is exhaustive but blind. LLM generation is selective but biased toward common scenarios.

Either way, inspect a sample before scaling up. Some combinations are impossible. "Pediatric + Medicare" makes no sense (Medicare covers seniors, not children). "Off-label + cosmetic procedure" might be contradictory. Catching these early saves you from generating hundreds of useless queries.

Bias Toward Failure Modes: the CTO Framework

Agentic systems fail in three distinct ways. Each requires different evaluation criteria.

Components are the building blocks: the retriever, the tool executor, the reasoning module.

The retriever returns outdated documents.
The API call has malformed parameters.
The model hallucinates a citation.

These are component failures, relatively easy to catch with unit-test-style checks.

Trajectories are the sequences of decisions: which tools to call, in what order, with what inputs, and how to combine results.

The system picks the wrong tool for the task.
It loops through the same search three times.
It ignores a structured API in favor of web search.

Trajectory failures require inspecting the full trace, not just individual steps.

Outcomes are what ships to users: task completion, format, reliability over time.

The letter is technically correct but too verbose.
Users edit every output before sending.

Outcome failures require evaluation against real user needs.

A Trajectory Failure in Action

Here's what a trajectory failure looks like in practice, using the prior authorization writer as an example. Every component succeeds, but the output is wrong:

Step 1: Search for coverage policy. The agent retrieves a PDF of Cigna's Dupixent policy. Current, relevant, authoritative.

Step 2: Search for step-therapy requirements. The system needs to know which medications the patient must try first. It searches and retrieves several results.

Step 3: Extract the drug list. The agent finds three different lists of medications. They don't agree. It picks the shortest one.

Step 4: Check patient history. The patient has tried both medications on the short list. Success.

Step 5: Draft letter. Well-structured, cites the policy, argues the patient qualifies.

Every component worked. But the letter will be denied.

The failure was in Step 2. A structured formulary API existed, but the system chose web search. What came back? A WebMD article. A competitor's marketing page. A Reddit thread from two years ago. The system couldn't distinguish authoritative from unreliable. It picked the shortest list because that's all it had. That list was wrong.

Where do these failure modes come from? You find them by looking at traces.

From Traces to Goals

Before you can bias query generation toward failure modes, you need to know what they are. That requires error analysis.

1. Collect traces. Gather 50-100 traces from production or synthetic usage. Each trace should capture the full trajectory: plan, tool calls, retrieved documents, reasoning, and output.

If you haven't instrumented your agent yet, start there. Tools like Langfuse and Logfire capture traces with minimal code changes.

A minimal trace:

trace = {
    "query": "Does Cigna cover Dupixent for pediatric eczema?",
    "steps": [
        {"tool": "search", "input": "Cigna Dupixent coverage", "output": "policy_2025.pdf"},
        {"tool": "search", "input": "step therapy requirements", "output": ["webmd.com/...", "reddit.com/..."]},
        {"tool": "extract", "input": "prior therapy list", "output": ["methotrexate"]},

    ],
    "output": "Letter drafted successfully.",
    "outcome": "denied",
}

2. Write freeform notes. For each trace, describe what went wrong in plain language. No rubric. No scoring:

notes = [
    "Stale policy retrieved: 2025 instead of 2026.",
    "Chose web search over formulary API for step therapy.",
    "Letter correct but wrong tone for Cigna.",
    "Hallucinated a citation that doesn't exist.",
    "Ignored conflicting evidence from FDA label.",
]

Observe first, categorize later. After 50 traces, your notes will start repeating ("stale policy" and "outdated reference" are the same issue). Group similar notes. Count instances. When new traces fit existing categories, you've seen enough to act.

Yes, this is manual work. Yet it's the only reliable way to learn how your system actually fails, not how you imagine it might.

3. Turn categories into goals. Each repeating cluster becomes a goal: a description specific enough that an LLM can generate queries targeting that exact failure mode.

Four principles for writing effective goals:

Be explicit, not aspirational. "Use reliable sources" gives the LLM nothing to target. Name the tool, the failure, and the consequence:

Bad:

Use reliable sources for drug information.

Good:

The system has access to a formulary API. It should prefer the formulary API over web search for step-therapy requirements. When it falls back to web search, it retrieves consumer health sites and forum posts that list incorrect drug requirements.

Explain why it matters. Context helps the LLM vary the failure dimension. Explaining why stale policies are dangerous prompts the model to generate queries that probe different staleness levels (last week, last month, last year) instead of one generic scenario:

Bad:

Always use current policies.

Good:

It must always cite the most current version of each payer's policy. Stale guidelines are a compliance risk. Payers update coverage criteria quarterly, and a letter citing superseded rules will be denied on procedural grounds before clinical merit is even considered.

Preserve specifics from your notes.

Bad:

Fabricated references are unacceptable

Good:

Every clinical claim needs a traceable source from the retrieved documents; fabricated references cause the entire letter to be flagged for manual review.

One failure mode per goal. Packing freshness, source selection, conflict resolution, and formatting into a single goal prevents the LLM from isolating each concern. Split them into separate goals:

goals = """
1. POLICY FRESHNESS: The system must cite the most current version of
   each payer's policy. Payers update coverage criteria quarterly.
   A letter citing superseded rules will be denied on procedural
   grounds.

2. SOURCE SELECTION: The system has access to a formulary API and an
   internal policy database. It should prefer those over generic web
   search for drug lists and step-therapy requirements.

3. CONFLICT RESOLUTION: When two retrieved policies contradict each
   other, the assistant must surface the conflict explicitly instead
   of silently picking one.

4. OUTPUT FORMAT: Reviewers expect a structured checklist they can
   sign off on, not a wall of prose.
"""

Each goal maps to a cluster of freeform notes. Each is specific enough to generate queries that trigger that failure mode.

Implementation

For the code examples below, I'll use evaluateur, an open-source library that implements this pattern. The principles work with any tooling.

Start with a Pydantic model that defines your dimensions:

from pydantic import BaseModel, Field

class Query(BaseModel):
    payer: str = Field(
        ...,
        description="Insurance payer",
    )
    age: int = Field(
        ...,
        description="Patient's age",
    )
    indication: str = Field(
        ...,
        description="Clinical indication",
    )
    comorbidities: str = Field(
        ...,
        description="Comorbidities affecting the patient, different from the indication",
    )
    state: list[str] = Field(
        description="Patient's state",
        default=["California", "New York", "Texas"],
    )

Each field is a dimension. The description guides the LLM when generating options. To lock a dimension to a fixed set, use a list type with a default value (like state above).

Generate options for each dimension:

from evaluateur import Evaluator

# use the query model from the previous step
evaluator = Evaluator(Query, llm="anthropic/claude-haiku-4-5")

options = await evaluator.options(
    instructions=(
        "Focus on common US payers. "
        "Cover a wide range of ages, from newborn to elderly."
    ),
    count_per_field=5,
)

The results:

QueryOptions(
    payer=['UnitedHealth Group (UHC)', 'Cigna', 'Aetna', 'Humana', 'Blue Cross Blue Shield'],
    age=[2, 18, 45, 65, 82],
    indication=[
        'Type 2 Diabetes Mellitus',
        'Hypertension',
        'Metastatic Lung Cancer',
        'Heart Failure',
        'Atrial Fibrillation'
    ],
    comorbidities=['Chronic Kidney Disease', 'Obesity', 'Depression', 'COPD', 'Hyperlipidemia'],
    state=['California', 'New York', 'Texas']
)

Sample tuples from the cross product. By default, Farthest Point Sampling maximizes diversity across dimensions:

from evaluateur import TupleStrategy

# evaluator and options from the previous step
tuples = []
async for t in evaluator.tuples(
    options,
    strategy=TupleStrategy.CROSS_PRODUCT,
    count=3,
    seed=42,
):
    tuples.append(t)

The output:

[
    GeneratedTuple(
        payer='Blue Cross Blue Shield',
        age=65,
        indication='Hypertension',
        comorbidities='Hyperlipidemia',
        state='New York'
    ),
    GeneratedTuple(
        payer='Humana',
        age=45,
        indication='Heart Failure',
        comorbidities='Obesity',
        state='Texas'
    ),
    GeneratedTuple(
        payer='Cigna',
        age=2,
        indication='Metastatic Lung Cancer',
        comorbidities='Depression',
        state='California'
    )
]

Sometimes cross-product sampling gives you implausible combinations, like a 2-year-old with metastatic lung cancer and depression. You can use the LLM to generate tuples directly instead:

# evaluator and options from the previous step
tuples = []
async for t in evaluator.tuples(
    options,
    strategy=TupleStrategy.AI,
    count=3,
    instructions=(
        "Generate realistic, clinically coherent patient scenarios. "
        "Avoid impossible combinations (e.g. pediatric age with adult-onset diseases). "
    ),
):
    tuples.append(t)

The LLM-generated tuples:

[
    GeneratedTuple(
        indication='Metastatic Lung Cancer',
        payer='Blue Cross Blue Shield',
        comorbidities='Depression',
        age=18,
        state='California'
    ),
    GeneratedTuple(
        indication='Heart Failure',
        payer='Humana',
        comorbidities='Chronic Kidney Disease',
        age=82,
        state='Texas'
    ),
    GeneratedTuple(
        indication='Atrial Fibrillation',
        payer='Aetna',
        comorbidities='COPD',
        age=65,
        state='New York'
    )
]

Convert tuples to natural language queries:

# evaluator and tuples from the previous step
async for q in evaluator.queries(
    tuples=tuples,
    instructions=(
        "Be creative about the additional clinical context for each query. "
        "Write the queries from the perspective of a doctor writing a prior authorization request. "
        "Mention specific products in the request."
    ),
):
    print(q.query)

Query 1:

I need to submit a prior authorization request to Blue Cross Blue Shield for my 18-year-old patient in California with metastatic lung cancer who also has depression. The patient is a candidate for Opdivo (nivolumab) combined with Yervoy (ipilimumab) immunotherapy. Given the patient's documented depression requiring concurrent antidepressant therapy, what documentation should I include in the prior auth to support approval and address any concerns about treatment tolerability?

Query 2:

I'm requesting prior authorization from Humana for my 82-year-old patient in Texas with Stage 3 chronic kidney disease and New York Heart Association Class III heart failure. The patient has been declining on current therapy with metoprolol and lisinopril, and I need approval to initiate sacubitril/valsartan (Entresto) to improve cardiac function while managing the reduced renal clearance. Can you help me draft the clinical justification for this request?

Query 3:

I need to submit a prior authorization request to Aetna for my 65-year-old patient in New York with atrial fibrillation and COPD. The patient has had two episodes of paroxysmal AFib over the past three months despite being on metoprolol, and given his underlying COPD, I want to prescribe apixaban rather than warfarin for stroke prevention. Can you help me structure this request?

Now feed these queries to your agent, collect traces, and analyze the results. Then use your freeform notes to seed goal-guided generation:

from evaluateur import Evaluator, TupleStrategy

# evaluator and options from the previous step

goals = """
We're building an AI assistant for writing prior authorization letters.
It must always cite the most current version of each payer's policy. Stale
guidelines are a compliance risk. Every clinical claim in the output needs
a traceable source from the retrieved documents; fabricated references are
unacceptable.

The system has access to a formulary API and an internal policy database.
It should prefer those over generic web search. When two policies
contradict each other, the assistant must surface the conflict instead of
silently picking one.

Reviewers expect a structured checklist they can sign off on, not a wall
of prose.
"""

tuples = []
async for t in evaluator.tuples(
    options,
    strategy=TupleStrategy.AI,
    count=3,
    instructions=(
        "Generate realistic, clinically coherent patient scenarios. "
        "Avoid impossible combinations (e.g. pediatric age with adult-onset diseases). "
    ),
    seed=42,
):
    tuples.append(t)

async for q in evaluator.queries(
    tuples=tuples,
    goals=goals,
    goal_mode="cycle",
    instructions=(
        "Write the queries from the perspective of a doctor "
        "composing a prior authorization request on behalf of a patient. "
        "Mention specific products in the request. "
        "Be creative about the additional clinical context for each query."
    ),
):
    print(f"- {q.metadata.goal_focus} ({q.metadata.goal_category})\n  {q.query!r}")

The new queries target the goals inferred from your notes. Each query focuses on a single goal, cycling through them:

Goal 1: Current Policy Retrieval (components)

I need to request prior authorization from Humana for my 45-year-old patient with metastatic lung cancer and obesity who is a New York resident. The patient requires Opdivo (nivolumab) and Yervoy (ipilimumab) combination immunotherapy. Can you pull the current Humana formulary policy for these agents and confirm we have the latest coverage guidelines, especially since I recall older policies had different approval criteria for patients with BMI complications?

Goal 2: Tool Preference Ordering (trajectories)

I need to submit a prior authorization for my 82-year-old patient in California with COPD and heart failure who requires SGLT2 inhibitor therapy - specifically dapagliflozin for cardiorenal protection. Can you verify UnitedHealth Group's coverage requirements and any step therapy restrictions for this indication?

Goal 3: Structured Checklist Output (outcomes)

I need to submit a prior authorization request to Cigna for my 65-year-old patient in Texas with Type 2 Diabetes Mellitus and Chronic Kidney Disease. The patient requires treatment with Jardiance (empagliflozin) for glycemic control and renal protection. Can you generate a structured checklist-format prior authorization letter that lists all required medical justifications, clinical criteria, and reviewer sign-off items so our compliance team can easily validate each component before submission?

The Iteration Loop

The real power of this approach is in repetition. Each cycle follows four steps:

Generate queries from your dimensions, guided by goals if you have them.
Run them through your agent and collect traces.
Analyze the traces. Write freeform notes about what broke.
Convert those notes into goals for the next round.

Here is one cycle as a reusable function:

from collections.abc import Awaitable, Callable

from pydantic import BaseModel, Field
from evaluateur import Evaluator, TupleStrategy


class Query(BaseModel):
    payer: str = Field(..., description="Insurance payer")
    age: int = Field(..., description="Patient's age")
    indication: str = Field(..., description="Clinical indication")
    comorbidities: str = Field(
        ..., description="Comorbidities affecting the patient"
    )
    state: list[str] = Field(
        description="Patient's state",
        default=["California", "New York", "Texas"],
    )


async def eval_cycle(
    run_agent: Callable[[str], Awaitable[str]],
    goals: str | None = None,
    count: int = 10,
    seed: int = 0,
) -> list[dict]:
    """One generate-run-collect cycle. Returns results for manual review."""
    evaluator = Evaluator(Query, llm="anthropic/claude-haiku-4-5")

    results = []
    async for q in evaluator.run(
        tuple_strategy=TupleStrategy.AI,
        tuple_count=count,
        seed=seed,
        goals=goals,
        instructions=(
            "Write from a doctor's perspective. "
            "Mention specific products in the request."
        ),
    ):
        agent_output = await run_agent(q.query)
        results.append({
            "query": q.query,
            "output": agent_output,
            "goal": q.metadata.goal_focus,
        })

    return results

Round 1 runs without goals. It gives you a baseline:

# Round 1: no goals, broad coverage
round_1 = await eval_cycle(run_agent=your_agent, count=20)

Review the results. Write notes: "stale policy in 4 of 20 runs," "web search chosen over formulary API in 6 cases," "output was prose instead of a checklist." Convert those patterns into goals and run again:

# Round 2: targeting failures found in round 1
goals = """
1. POLICY FRESHNESS: Always cite the most current payer policy.
   Payers update criteria quarterly; superseded rules cause denials.

2. SOURCE SELECTION: Prefer the formulary API and internal policy
   database over web search for drug lists and step-therapy data.

3. STRUCTURED OUTPUT: Produce a checklist reviewers can sign off on.
"""

round_2 = await eval_cycle(run_agent=your_agent, goals=goals, seed=1)

Round 2 stress-tests the specific failures you found. Review again. Refine your goals. Add new ones. Run round 3 with a different seed.

The analysis step is manual and deliberate. No automated classifier replaces the intuition you build from reading traces.

Each cycle tightens coverage. The first round catches obvious failures. By the third, you're stress-testing edge cases that real traffic won't hit for months. When production traffic arrives, feed those traces back into the loop. Production failures become new goals.

When This Approach Falls Short

Three caveats worth naming.

Synthetic queries complement production traffic; they don't replace it. Real users combine intent, context, and phrasing in ways no generator can fully replicate. Use synthetic queries to find structural failures early. Use production traces to find the failures you couldn't imagine.

LLM-generated goals inherit the LLM's blind spots. If the model that writes your goals can't conceive of a failure mode, the generated queries won't probe for it. The manual analysis step exists precisely because a human reading traces notices patterns that an LLM would not flag.

Manual analysis doesn't scale. Reading 50 traces is feasible. At high volume, you'll need automated checklists or anomaly detection to triage traces before human review.

Conclusion

The eval loop is simple: generate queries across dimensions, run them, find failures, turn observations into goals, generate again. Each iteration maps more of the failure space. The system improves not because you wrote more code, but because you asked better questions.

Start small. Pick three or four dimensions, generate 50 queries, and write down what breaks. Categorize failures as component, trajectory, or outcome issues. Those categories become goals for the next round.

The loop doesn't end. But the failures get rarer, and weirder, and eventually you're catching things that would have taken months to surface in the wild.