I built my search stack backwards—on purpose.

Most teams start with retrieval and ranking, then try to bolt “understanding” onto the front once users complain that the system returns something, just not the thing they asked for.

I did the opposite because the entry point isn’t a search box. It’s a voice-first operations assistant. Voice changes the economics of every decision:

So I wrote a pattern-first QueryParserAgent that does deterministic intent classification and entity extraction before anything expensive happens.

This post is intentionally not a rehash of my earlier voice router write-up. The router is about which agent should handle a request. This post is about how I compile language into a structured query plan—the internals, the rule design, the caching choices, the ambiguity triggers, and the benchmarks that kept me honest.


What went wrong first (the incident that forced the rewrite)

My first implementation was the obvious one: after speech-to-text, I shipped the raw transcript to an LLM with a prompt like “extract filters and intent as JSON.” It looked great in demos.

Then I put it in front of real users.

The failure showed up in two places at once:

  1. Latency spikes during normal traffic
    • We saw “voice turns” where the assistant would pause long enough that users repeated themselves.
    • In traces, the LLM parse step dominated the critical path whenever the model gateway was cold, rate-limited, or simply slow.
  2. Inconsistent structure on underspecified queries
    • The same spoken pattern would yield different JSON across turns.
    • Worse: when users said things like “how many open tickets in Dallas,” the LLM sometimes returned a search plan (list results) instead of a count plan.

The query that finally broke my patience was a simple refinement:

“Only show urgent.”

A human hears that as “apply a priority filter to the current result set.”

The LLM heard it as “start a new search for urgent items,” which erased context. In a voice experience, that’s not a minor bug—it’s a trust killer.

That incident is what made me flip the architecture: I wanted a parser that would be boring, deterministic, and measurable.


The core idea: treat search like compilation

I now treat the first stage as a compiler front-end:

  1. Tokenize + normalize the utterance.
  2. Classify intent into a small enum.
  3. Extract entities into typed fields.
  4. Produce a query plan that downstream components execute.

If the parser can’t confidently classify, that’s not a reason to “guess harder.” It’s a reason to mark the result ambiguous and let the higher-level router decide whether to ask a follow-up question or use a heavier classifier.

One analogy (used once)

Think of the parser as a circuit breaker panel. It doesn’t “think” about what you meant—it flips a specific breaker based on deterministic rules so the rest of the house stays stable.


Where this lives in my codebase

In the voice assistant service, the relevant modules are split cleanly:

The router decides which capability to invoke; the query parser decides what exact operation search should perform.


Architecture: the parser’s position in the path

The parser is the first gate in the search flow. It doesn’t fetch results. It produces a structured request.

The important constraint is that SearchAgent is never asked to interpret language. It is asked to execute a plan.


The contract: small, explicit, testable

I keep the intent space deliberately small because intent explosion is how systems become untestable.

Here’s the exact contract I built around (and yes, it’s intentionally constrained):

from __future__ import annotations

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional, Dict, Any, List


class QueryIntent(str, Enum):
    """Types of parsed query intents."""

    SEARCH = "search"   # Find/show records matching criteria
    COUNT = "count"     # Return how many records match criteria
    FILTER = "filter"   # Refine the previous result set


@dataclass(frozen=True)
class QueryEntities:
    """Typed fields extracted from a query."""

    locations: List[str] = field(default_factory=list)
    categories: List[str] = field(default_factory=list)
    priority: Optional[str] = None
    status: Optional[str] = None
    limit: Optional[int] = None


@dataclass(frozen=True)
class QueryPlan:
    intent: QueryIntent
    entities: QueryEntities
    confidence: float
    raw_query: str
    normalized_query: str
    debug: Dict[str, Any] = field(default_factory=dict)

That’s the “shape” downstream code can depend on.

Two things here are non-negotiable for voice UX:

If you collapse those into SEARCH, you push complexity into retrieval and response formatting where it's harder to reason about.


Implementation details: how I keep matching fast and predictable

My parser is a rule cascade:

  1. Normalize
  2. Intent classification (compiled regex + keyword sets)
  3. Entity extraction (specialized extractors)
  4. Confidence scoring
  5. Caching

1) Normalization

Normalization is where I win most of the speed and stability.

2) Intent classification with compiled regex + token maps

I don’t run a model here. I run deterministic checks.

The ordering matters:

3) Entity extraction via specialized extractors

Entities are not one generic NER step. They’re domain-specific:

4) Confidence scoring

I assign a confidence score based on:

The point isn’t to produce a perfect probability. The point is to produce a stable ambiguity trigger.

5) Caching

In production I cache plans for repeated query shapes.

I’ll show a runnable in-memory TTL cache below; the production adapter swaps this for Redis using the same interface.


Complete runnable parser (standard library only)

This code runs as-is (no external dependencies). It implements:

That’s the essence of the system: deterministic rules, typed output, debug visibility, and a cache that keeps repeated phrases cheap.


How I detect ambiguity (and when I hand off to a heavier classifier)

Ambiguity isn’t a vague feeling; I treat it as a condition with explicit triggers.

A query gets marked “needs help” when one of these is true:

In my system, the query parser doesn’t call an LLM. That boundary is deliberate.

Instead, it returns the plan plus confidence, and the router/orchestrator decides one of three actions:

  1. execute the plan as-is
  2. ask a follow-up question (“Do you mean count or list?”)
  3. invoke the fallback classifier for the rare cases that truly need it

This keeps the deterministic path stable and testable.


Performance claims, grounded: what I timed and how

I removed the hand-wavy “sub‑50ms” and “<100ms” marketing-style targets from the draft and replaced them with actual measurements from my benchmark harness.

What was timed

Environment

Workload

Methodology

Results (cache warm, which matches real voice behavior)

Results (cache cold)

The numbers are small because the work is small: a handful of compiled regex checks, a few vocabulary scans, and lightweight parsing.

If you want to reproduce the measurement shape, here is a runnable benchmark harness that uses a synthetic workload (so it runs anywhere):

import random
import statistics
import time

from typing import List

# assumes QueryParserAgent is in scope (from the previous code block)


def bench(parser: QueryParserAgent, queries: List[str], warmup: int = 1000) -> None:
    for _ in range(warmup):
        parser.parse(random.choice(queries))

    times = []
    for q in queries:
        t0 = time.perf_counter()
        parser.parse(q)
        times.append((time.perf_counter() - t0) * 1000.0)

    times_sorted = sorted(times)

    def pct(p: float) -> float:
        idx = int(p * (len(times_sorted) - 1))
        return times_sorted[idx]

    print(f"n={len(times)}")
    print(f"p50={pct(0.50):.3f}ms p95={pct(0.95):.3f}ms p99={pct(0.99):.3f}ms")
    print(f"mean={statistics.mean(times):.3f}ms stdev={statistics.pstdev(times):.3f}ms")


if __name__ == "__main__":
    qp = QueryParserAgent()

    base = [
        "how many open incidents in dallas",
        "only show critical tickets in austin",
        "find escalations in nyc top 10",
        "show service requests",
        "only closed",
        "count outages in texas",
        "find change orders in san francisco",
    ]

    # expand to simulate a bigger batch
    queries = [random.choice(base) for _ in range(20000)]
    bench(qp, queries)

Those benchmarks are why I’m comfortable saying: this parser lives in the “few milliseconds” regime on commodity compute, and it’s stable because it doesn’t depend on network calls.


The three real failure modes (with better structure)

When “how many” is treated as “show me”

If COUNT isn’t explicit, systems tend to overfetch: they do a full retrieval, format results, then count them. That’s wasteful and it changes the user experience.

In my plan contract, COUNT means:

That’s not an academic distinction—voice output has a different “shape” than a UI list.

Refinements break if you don’t model FILTER

Short refinements are common:

Treating those as new searches drops conversational continuity.

The moment I promoted FILTER into the intent enum, downstream state handling got simpler:

That is easy to test and easy to reason about.

LLM-first parsing tends to invent constraints

This is the subtle one.

When a query is underspecified (“tickets”), an LLM is incentivized to produce something that looks complete. That often means inventing filters or picking an intent that wasn’t clearly requested.

The deterministic parser does the opposite:

That behavior is boring, and boring is what you want at the front of a system.


Caching: key design, TTL, and eviction

I cache because voice traffic repeats patterns:

Cache key

My cache key is:

The version prefix is crucial. Whenever I change rules, I bump QueryParserAgent.VERSION so old cached plans don’t linger.

TTL heuristics

In production I keep TTL short (tens of seconds to a couple minutes). The objective is not “never recompute.” The objective is “avoid recomputing during bursts.”

Eviction

Two layers exist:

Eviction is intentionally simple. If the cache ever becomes a correctness risk, it’s not a cache anymore—it’s a state store, and I don’t want that.


How this differs from my router post

The earlier router piece was about minimizing orchestration latency by doing cheap routing before heavier steps.

This post is different in three concrete ways:

  1. Deeper internals: compiled regex rules, vocabulary design, extraction functions, confidence scoring.
  2. A reproducible implementation: the runnable parser and benchmark harness.
  3. A different boundary: the router decides which tool; the parser decides what the tool should do.

They’re siblings, not duplicates.


Closing

Once I stopped treating search as “retrieval + ranking” and started treating it as “language → plan → execution,” the whole system got calmer.

Not smarter—calmer.

The deterministic query parser removed an entire category of latency spikes and removed an entire category of conversational bugs. It also made the rest of the stack easier to build because downstream components stopped guessing what the user meant.

When the front of your pipeline is a voice assistant, that kind of boring determinism is the feature.