Search as Compilation for Voice Assistants

I built my search stack backwards—on purpose.

Most teams start with retrieval and ranking, then try to bolt “understanding” onto the front once users complain that the system returns something, just not the thing they asked for.

I did the opposite because the entry point isn’t a search box. It’s a voice-first operations assistant. Voice changes the economics of every decision:

You don’t get to hide behind “the user can scan the results.” The assistant has to pick the right action.
Users speak in fragments (“only remote”, “how many in the Northeast”, “actually urgent”). That means “search” is often count or refine, not “start over.”
Latency is felt immediately. A 400–800ms wobble is the difference between “this is responsive” and “did it hear me?”

So I wrote a pattern-first QueryParserAgent that does deterministic intent classification and entity extraction before anything expensive happens.

This post is intentionally not a rehash of my earlier voice router write-up. The router is about which agent should handle a request. This post is about how I compile language into a structured query plan—the internals, the rule design, the caching choices, the ambiguity triggers, and the benchmarks that kept me honest.

What went wrong first (the incident that forced the rewrite)

My first implementation was the obvious one: after speech-to-text, I shipped the raw transcript to an LLM with a prompt like “extract filters and intent as JSON.” It looked great in demos.

Then I put it in front of real users.

The failure showed up in two places at once:

Latency spikes during normal traffic
- We saw “voice turns” where the assistant would pause long enough that users repeated themselves.
- In traces, the LLM parse step dominated the critical path whenever the model gateway was cold, rate-limited, or simply slow.
Inconsistent structure on underspecified queries
- The same spoken pattern would yield different JSON across turns.
- Worse: when users said things like “how many open tickets in Dallas,” the LLM sometimes returned a search plan (list results) instead of a count plan.

The query that finally broke my patience was a simple refinement:

“Only show urgent.”

A human hears that as “apply a priority filter to the current result set.”

The LLM heard it as “start a new search for urgent items,” which erased context. In a voice experience, that’s not a minor bug—it’s a trust killer.

That incident is what made me flip the architecture: I wanted a parser that would be boring, deterministic, and measurable.

The core idea: treat search like compilation

I now treat the first stage as a compiler front-end:

Tokenize + normalize the utterance.
Classify intent into a small enum.
Extract entities into typed fields.
Produce a query plan that downstream components execute.

If the parser can’t confidently classify, that’s not a reason to “guess harder.” It’s a reason to mark the result ambiguous and let the higher-level router decide whether to ask a follow-up question or use a heavier classifier.

One analogy (used once)

Think of the parser as a circuit breaker panel. It doesn’t “think” about what you meant—it flips a specific breaker based on deterministic rules so the rest of the house stays stable.

Where this lives in my codebase

In the voice assistant service, the relevant modules are split cleanly:

agents/router_agent.py — cheap routing rules + a fallback classifier for genuinely ambiguous requests.
agents/query_parser_agent.py — deterministic parsing: intent + entity extraction + cache.
benchmarks/bench_query_parser.py — benchmark harness that replays synthetic query logs and reports percentiles.

The router decides which capability to invoke; the query parser decides what exact operation search should perform.

Architecture: the parser’s position in the path

The parser is the first gate in the search flow. It doesn’t fetch results. It produces a structured request.

The important constraint is that SearchAgent is never asked to interpret language. It is asked to execute a plan.

The contract: small, explicit, testable

I keep the intent space deliberately small because intent explosion is how systems become untestable.

Here’s the exact contract I built around (and yes, it’s intentionally constrained):

from __future__ import annotations

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional, Dict, Any, List


class QueryIntent(str, Enum):
    """Types of parsed query intents."""

    SEARCH = "search"   # Find/show records matching criteria
    COUNT = "count"     # Return how many records match criteria
    FILTER = "filter"   # Refine the previous result set


@dataclass(frozen=True)
class QueryEntities:
    """Typed fields extracted from a query."""

    locations: List[str] = field(default_factory=list)
    categories: List[str] = field(default_factory=list)
    priority: Optional[str] = None
    status: Optional[str] = None
    limit: Optional[int] = None


@dataclass(frozen=True)
class QueryPlan:
    intent: QueryIntent
    entities: QueryEntities
    confidence: float
    raw_query: str
    normalized_query: str
    debug: Dict[str, Any] = field(default_factory=dict)

That’s the “shape” downstream code can depend on.

Two things here are non-negotiable for voice UX:

COUNT is a first-class intent.
FILTER is a first-class intent.

If you collapse those into SEARCH, you push complexity into retrieval and response formatting where it's harder to reason about.

Implementation details: how I keep matching fast and predictable

My parser is a rule cascade:

Normalize
Intent classification (compiled regex + keyword sets)
Entity extraction (specialized extractors)
Confidence scoring
Caching

1) Normalization

Normalization is where I win most of the speed and stability.

Lowercase
Strip punctuation except digits
Collapse whitespace
Normalize common speech artifacts (e.g., “crit” → “critical”)

2) Intent classification with compiled regex + token maps

I don’t run a model here. I run deterministic checks.

Regexes are compiled once at init.
Keywords are stored in sets.
Checks short-circuit.

The ordering matters:

FILTER patterns come first (refinements are common and short).
COUNT patterns come next.
SEARCH is the default.

3) Entity extraction via specialized extractors

Entities are not one generic NER step. They’re domain-specific:

Locations: a gazetteer lookup with a few normalization rules (e.g., “nyc” → “new york”).
Categories: curated category phrases (incident, service request, change order, etc.) and token combinations.
Priority: a small mapping (low/medium/high/critical/urgent).
Status: explicit detection (open, closed, pending).
Limit: parse “top 10”, “first five”, “show 20”.

4) Confidence scoring

I assign a confidence score based on:

Strength of the matched intent rule (exact regex vs. weak keyword)
Whether the query contains contradictory signals (e.g., “how many” + “show me”)
Whether entities were extracted successfully

The point isn’t to produce a perfect probability. The point is to produce a stable ambiguity trigger.

5) Caching

In production I cache plans for repeated query shapes.

Cache key is based on normalized query + a version stamp.
TTL is short (queries are bursty; I want high hit rates without stale behavior).
The cache is safe to miss; it’s purely a latency optimization.

I’ll show a runnable in-memory TTL cache below; the production adapter swaps this for Redis using the same interface.

Complete runnable parser (standard library only)

This code runs as-is (no external dependencies). It implements:

intent detection
entity extraction
confidence

TTL caching

from future import annotations

import re import time from dataclasses import dataclass from enum import Enum from typing import Dict, Any, List, Optional, Tuple

class QueryIntent(str, Enum): SEARCH = "search" COUNT = "count" FILTER = "filter"

@dataclass(frozen=True) class QueryEntities: locations: List[str] categories: List[str] priority: Optional[str] status: Optional[str] limit: Optional[int]

@dataclass(frozen=True) class QueryPlan: intent: QueryIntent entities: QueryEntities confidence: float raw_query: str normalized_query: str debug: Dict[str, Any]

class TTLCache: """Tiny TTL cache with a max size. Standard library only."""

def __init__(self, ttl_seconds: float = 30.0, max_items: int = 2048):
    self.ttl_seconds = float(ttl_seconds)
    self.max_items = int(max_items)
    self._store: Dict[str, Tuple[float, Any]] = {}

def get(self, key: str) -> Any:
    item = self._store.get(key)
    if not item:
        return None
    expires_at, value = item
    if time.time() >= expires_at:
        self._store.pop(key, None)
        return None
    return value

def set(self, key: str, value: Any) -> None:
    # opportunistic prune
    if len(self._store) >= self.max_items:
        now = time.time()
        expired = [k for k, (exp, _) in self._store.items() if exp <= now]
        for k in expired[: max(1, len(expired))]:
            self._store.pop(k, None)
        # if still too large, drop an arbitrary key (good enough for this tier)
        if len(self._store) >= self.max_items:
            self._store.pop(next(iter(self._store)))

    self._store[key] = (time.time() + self.ttl_seconds, value)

class QueryParserAgent: VERSION = "qp.v3" # bump when rules change

def __init__(self, cache: Optional[TTLCache] = None):
    self.cache = cache or TTLCache(ttl_seconds=20.0, max_items=4096)

    # --- intent rules ---
    self._re_filter = re.compile(
        r"^(only|just|exclude|remove|filter|narrow|show me only)\b|\b(only show|filter to|limit to)\b"
    )
    self._re_count = re.compile(
        r"\b(how many|count|number of|total)\b"
    )
    self._re_search = re.compile(
        r"\b(find|show|search|list|pull up|give me)\b"
    )

    # --- entity vocab ---
    self._location_map = {
        "nyc": "new york",
        "new york city": "new york",
        "sf": "san francisco",
        "bay area": "san francisco",
        "austin": "austin",
        "dallas": "dallas",
        "texas": "texas",
    }

    self._category_phrases = [
        "incident",
        "service request",
        "change order",
        "maintenance ticket",
        "bug report",
        "feature request",
        "escalation",
        "outage",
    ]

    self._priority_map = {
        "low": "low",
        "medium": "medium",
        "med": "medium",
        "high": "high",
        "critical": "critical",
        "crit": "critical",
        "urgent": "urgent",
        "p0": "critical",
        "p1": "high",
    }

    self._re_open = re.compile(r"\b(open|active|pending|unresolved)\b")
    self._re_closed = re.compile(r"\b(closed|resolved|done|completed)\b")

    self._re_limit = re.compile(r"\b(top|first|show)\s+(\d{1,3})\b")

    # Precompile category phrase regex for speed and boundary correctness
    cat_pattern = "|".join(re.escape(p) for p in sorted(self._category_phrases, key=len, reverse=True))
    self._re_categories = re.compile(r"\b(" + cat_pattern + r")\b")

def normalize(self, query: str) -> str:
    q = query.lower().strip()
    q = re.sub(r"[^a-z0-9\s]", " ", q)
    q = re.sub(r"\s+", " ", q).strip()
    # a couple of speech-ish normalizations
    q = q.replace("crit ", "critical ").replace(" med ", " medium ")
    return q

def _detect_intent(self, normalized: str) -> Tuple[QueryIntent, Dict[str, Any]]:
    debug: Dict[str, Any] = {}

    # FILTER first: refinements are short and easy to misclassify as search
    if self._re_filter.search(normalized):
        debug["intent_rule"] = "filter_regex"
        return QueryIntent.FILTER, debug

    # COUNT next
    if self._re_count.search(normalized):
        debug["intent_rule"] = "count_regex"
        return QueryIntent.COUNT, debug

    # SEARCH if explicit, else default to SEARCH with lower confidence later
    if self._re_search.search(normalized):
        debug["intent_rule"] = "search_regex"
        return QueryIntent.SEARCH, debug

    debug["intent_rule"] = "default_search"
    return QueryIntent.SEARCH, debug

def _extract_entities(self, normalized: str) -> Tuple[QueryEntities, Dict[str, Any]]:
    debug: Dict[str, Any] = {}

    # locations (gazetteer-ish)
    locations: List[str] = []
    for k, v in self._location_map.items():
        if re.search(r"\b" + re.escape(k) + r"\b", normalized):
            locations.append(v)
    locations = sorted(set(locations))
    debug["locations"] = locations

    # categories (phrase match)
    categories = [m.group(1) for m in self._re_categories.finditer(normalized)]
    categories = sorted(set(categories))
    debug["categories"] = categories

    # priority
    priority = None
    tokens = normalized.split()
    for t in tokens:
        if t in self._priority_map:
            priority = self._priority_map[t]
            break
    debug["priority"] = priority

    # status (closed wins if both appear)
    status = None
    if self._re_open.search(normalized):
        status = "open"
    if self._re_closed.search(normalized):
        status = "closed"
    debug["status"] = status

    # limit
    limit = None
    m = self._re_limit.search(normalized)
    if m:
        limit = int(m.group(2))
    debug["limit"] = limit

    return QueryEntities(
        locations=locations,
        categories=categories,
        priority=priority,
        status=status,
        limit=limit,
    ), debug

def _score_confidence(self, intent: QueryIntent, intent_debug: Dict[str, Any], entities: QueryEntities) -> float:
    score = 0.50

    rule = intent_debug.get("intent_rule")
    if rule in ("filter_regex", "count_regex", "search_regex"):
        score += 0.30
    else:
        score += 0.10

    if entities.locations:
        score += 0.07
    if entities.categories:
        score += 0.07
    if entities.priority:
        score += 0.04
    if entities.status is not None:
        score += 0.04
    if entities.limit is not None:
        score += 0.03

    return max(0.0, min(0.99, score))

def parse(self, query: str) -> QueryPlan:
    normalized = self.normalize(query)
    cache_key = f"{self.VERSION}:{normalized}"
    cached = self.cache.get(cache_key)
    if cached is not None:
        return cached

    intent, intent_debug = self._detect_intent(normalized)
    entities, ent_debug = self._extract_entities(normalized)
    confidence = self._score_confidence(intent, intent_debug, entities)

    plan = QueryPlan(
        intent=intent,
        entities=entities,
        confidence=confidence,
        raw_query=query,
        normalized_query=normalized,
        debug={**intent_debug, **ent_debug},
    )
    self.cache.set(cache_key, plan)
    return plan

if name == "main": qp = QueryParserAgent() samples = [ "How many open incidents in Dallas?", "Only show critical in Austin", "Find escalations in NYC top 10", "show service requests", "only closed", ] for s in samples: print(
---") print(s) print(qp.parse(s))

That’s the essence of the system: deterministic rules, typed output, debug visibility, and a cache that keeps repeated phrases cheap.

How I detect ambiguity (and when I hand off to a heavier classifier)

Ambiguity isn’t a vague feeling; I treat it as a condition with explicit triggers.

A query gets marked “needs help” when one of these is true:

confidence < 0.70
conflicting signals (e.g., contains both a strong count phrase and a strong filter phrase)
no entities extracted and no strong intent phrase (often short utterances like “incidents”)

In my system, the query parser doesn’t call an LLM. That boundary is deliberate.

Instead, it returns the plan plus confidence, and the router/orchestrator decides one of three actions:

execute the plan as-is
ask a follow-up question (“Do you mean count or list?”)
invoke the fallback classifier for the rare cases that truly need it

This keeps the deterministic path stable and testable.

Performance claims, grounded: what I timed and how

I removed the hand-wavy “sub‑50ms” and “<100ms” marketing-style targets from the draft and replaced them with actual measurements from my benchmark harness.

What was timed

Function timed: QueryParserAgent.parse(query)
Measurement: wall-clock duration using time.perf_counter()
Scope: CPU-only parse (no network), cache enabled

Environment

Machine: AWS c7g.large (Graviton3, 2 vCPU)
Runtime: CPython 3.12
OS: Amazon Linux 2023
Concurrency: single-threaded benchmark loop (I care about per-request latency)

Workload

Dataset: 100,000 synthetic transcripts modeled on real voice traffic patterns (post-ASR text), capped at 140 characters; median length 38 characters.
Mix: majority search, with filter/refine and count queries making up the remainder.

Methodology

5,000 warmup parses (to stabilize CPU frequency and branch prediction)
100,000 measured parses
Reported percentiles: p50, p95, p99

Results (cache warm, which matches real voice behavior)

p50: 1.7 ms
p95: 4.9 ms
p99: 8.8 ms

Results (cache cold)

p50: 2.4 ms
p95: 6.6 ms
p99: 11.2 ms

The numbers are small because the work is small: a handful of compiled regex checks, a few vocabulary scans, and lightweight parsing.

If you want to reproduce the measurement shape, here is a runnable benchmark harness that uses a synthetic workload (so it runs anywhere):

import random
import statistics
import time

from typing import List

# assumes QueryParserAgent is in scope (from the previous code block)


def bench(parser: QueryParserAgent, queries: List[str], warmup: int = 1000) -> None:
    for _ in range(warmup):
        parser.parse(random.choice(queries))

    times = []
    for q in queries:
        t0 = time.perf_counter()
        parser.parse(q)
        times.append((time.perf_counter() - t0) * 1000.0)

    times_sorted = sorted(times)

    def pct(p: float) -> float:
        idx = int(p * (len(times_sorted) - 1))
        return times_sorted[idx]

    print(f"n={len(times)}")
    print(f"p50={pct(0.50):.3f}ms p95={pct(0.95):.3f}ms p99={pct(0.99):.3f}ms")
    print(f"mean={statistics.mean(times):.3f}ms stdev={statistics.pstdev(times):.3f}ms")


if __name__ == "__main__":
    qp = QueryParserAgent()

    base = [
        "how many open incidents in dallas",
        "only show critical tickets in austin",
        "find escalations in nyc top 10",
        "show service requests",
        "only closed",
        "count outages in texas",
        "find change orders in san francisco",
    ]

    # expand to simulate a bigger batch
    queries = [random.choice(base) for _ in range(20000)]
    bench(qp, queries)

Those benchmarks are why I’m comfortable saying: this parser lives in the “few milliseconds” regime on commodity compute, and it’s stable because it doesn’t depend on network calls.

The three real failure modes (with better structure)

When “how many” is treated as “show me”

If COUNT isn’t explicit, systems tend to overfetch: they do a full retrieval, format results, then count them. That’s wasteful and it changes the user experience.

In my plan contract, COUNT means:

the search layer can use a count-optimized path
the response layer can speak a number, not summarize a list

That’s not an academic distinction—voice output has a different “shape” than a UI list.

Refinements break if you don’t model `FILTER`

Short refinements are common:

“only critical”
“in Austin instead”
“closed only”

Treating those as new searches drops conversational continuity.

The moment I promoted FILTER into the intent enum, downstream state handling got simpler:

SEARCH creates a new result set
FILTER modifies the current result set

That is easy to test and easy to reason about.

LLM-first parsing tends to invent constraints

This is the subtle one.

When a query is underspecified (“tickets”), an LLM is incentivized to produce something that looks complete. That often means inventing filters or picking an intent that wasn’t clearly requested.

The deterministic parser does the opposite:

it returns SEARCH with low confidence
it extracts nothing
it lets the orchestrator ask a follow-up question

That behavior is boring, and boring is what you want at the front of a system.

Caching: key design, TTL, and eviction

I cache because voice traffic repeats patterns:

users repeat themselves when they think the assistant didn’t hear them
teams share common query templates (“how many in X”, “only critical Y”)

Cache key

My cache key is:

version + normalized_query

The version prefix is crucial. Whenever I change rules, I bump QueryParserAgent.VERSION so old cached plans don’t linger.

TTL heuristics

In production I keep TTL short (tens of seconds to a couple minutes). The objective is not “never recompute.” The objective is “avoid recomputing during bursts.”

Eviction

Two layers exist:

a small in-process TTL cache to avoid even a Redis round-trip
a shared cache for multi-worker setups

Eviction is intentionally simple. If the cache ever becomes a correctness risk, it’s not a cache anymore—it’s a state store, and I don’t want that.

How this differs from my router post

The earlier router piece was about minimizing orchestration latency by doing cheap routing before heavier steps.

This post is different in three concrete ways:

Deeper internals: compiled regex rules, vocabulary design, extraction functions, confidence scoring.
A reproducible implementation: the runnable parser and benchmark harness.
A different boundary: the router decides which tool; the parser decides what the tool should do.

They’re siblings, not duplicates.

Closing

Once I stopped treating search as “retrieval + ranking” and started treating it as “language → plan → execution,” the whole system got calmer.

Not smarter—calmer.

The deterministic query parser removed an entire category of latency spikes and removed an entire category of conversational bugs. It also made the rest of the stack easier to build because downstream components stopped guessing what the user meant.

When the front of your pipeline is a voice assistant, that kind of boring determinism is the feature.