How an OpenAPI-to-MCP bridge achieved 96% token reduction without losing discoverability.

Powerful models are not enough. The way we expose tools, APIs, and data to those models determines whether an AI assistant feels precise and responsive, or slow, confused, and expensive.

In this post, I’ll walk through the architecture of an OpenAPI-to-MCP bridge I built that converts large REST APIs into MCP tools. The concrete problem was simple to state but hard to solve:

How do you expose hundreds of API operations without blowing up the context window or confusing the AI?

The naive approach treated every API endpoint as its own MCP tool. With a platform like Elastic Path (disclosure: my place of employment) - which exposes hundreds of granular microservice endpoints across dozens of services - this exploded instantly. More than 300 tools, 157 resources, and 36,000 tokens were required just to enumerate what the server could do. All of that overhead was loaded before the user typed a single word.

By rethinking how context is exposed, I cut the initial token footprint down to ~1200 tokens (600–700 tokens can be achieved) + 1,250 tokens of instructions while maintaining full discoverability of every operation and surfacing nearly 2,907 documentation resources in a controlled way. The AI still has access to everything, but it learns about those capabilities progressively instead of all at once.

The pattern that emerged from this work is what I call the Progressive Context Disclosure Pattern, or PCD.

From OpenAPI Specs to Context Explosion

The bridge starts with OpenAPI specs. For a large e-commerce platform, the numbers look roughly like this:

If you convert each operation directly into an MCP tool with full descriptions, schemas, and examples, you end up with tens of thousands of tokens in tool definitions alone. In a 200K-token context, the naive version of this server consumed around 18% of the entire window at initialization.

The cognitive load was just as bad. From the model’s perspective, it saw a wall of similar-looking tools like ep_pxm_products_list, ep_pxm_products_get, ep_pxm_products_create, repeated across dozens of domains. Technically everything was available; practically, almost nothing was obvious.

On top of that sat a huge “shadow surface” of documentation: schemas, request/response examples, domain guides, workflow docs. Expose all of that at once and you easily push token usage past 40K–50K just to describe what exists.

This is exactly the situation PCD is designed to fix.

The Progressive Context Disclosure Pattern (PCD)

PCD is the architecture that emerged as I tried to make this OpenAPI-to-MCP bridge usable at scale.

The core idea:

Expose just enough structure for the AI to find what it needs, and reveal everything else only when requested.

The implementation is layered, not monolithic. The server:

  1. Starts with a discovery-first surface instead of exposing everything
  2. Uses tag-based grouping to organize operations into business-domain tools
  3. Filters resources by path depth so only high-value endpoints become first-class resources
  4. Relies on cursor-based pagination to keep tool and resource lists small
  5. Treats documentation as on-demand context, not default payload
  6. Implements a dedicated search layer so the AI does not have to “know MCP” to find operations
  7. Optimizes API responses so the model sees what matters, not every bit of metadata
  8. And applies context-aware authentication so each call uses the right OAuth2 flow automatically, with security boundaries enforced by design

All of this sits on a multi-tenant, configuration-driven, MCP-spec compliant server with circuit breakers protecting production.

Layer 1: Discovery-First, Not “Everything-First”

The first shift was philosophical. Instead of turning every endpoint into a separate tool and exposing them all, I introduced the concept of discovery tools.

On initial load, the server exposes a small, fixed set of tools whose entire purpose is to help the model discover everything else:

This is the “front door” of the server. Everything else - hundreds of operations and thousands of documentation resources - is reachable through those tools, not pre-loaded as separate tool definitions.

A typical flow looks like:

  1. The AI calls search_operations or available_tags with a query such as “products”, “accounts”, or “create product”.
  2. The server searches across all OpenAPI-derived operations (and optionally documentation) and returns a small, filtered set of matches.
  3. If the AI needs more detail, it calls mcp_resource_read on a docs:// URI to fetch request/response examples, schemas, or conceptual docs.
  4. When it is ready to act, it calls the relevant domain tool (for example, elasticpath_products) with a specific operation value.

From the model’s perspective, the surface area is small and stable: a way to search operations, explore tags, list resources, read docs, enumerate prompts, and batch requests. Everything else is discovered through those entry points.

In raw numbers, this discovery-first layer alone shrinks the initial token exposure from about 36,000 tokens to ~500–600 aggressively tuned. That’s a 96–97% reduction before we even optimize how tools are organized.

Layer 2: Tag-Based Grouping — Intelligent Operation Organization

The next problem was tool organization. A flat list of hundreds of tools is not just expensive; it’s cognitively hostile.

The traditional mapping of one tool per operation doesn’t scale. Two hundred operations become two hundred tools. You end up with 17,000+ tokens just for tool definitions, and the AI has to scan a haystack of similarly named tools to find the needle it needs.

The bridge uses OpenAPI tags as the primary dimension for grouping.

In the spec, operations might look like this:

paths:
  /products:
    get:
      tags: ["Products"]
      operationId: "listProducts"
      
  /products/{id}:
    get:
      tags: ["Products"]
      operationId: "getProduct"
    put:
      tags: ["Products"]
      operationId: "updateProduct"

Everything tagged "Products" becomes part of a single MCP tool, such as elasticpath_products. That tool exposes an operation enum parameter that controls which underlying endpoint is executed.

Real-world specs often fragment tags: “Accounts”, “Account Members”, “Account Membership”, “Account Addresses”, “Account Tags”, “Account Cart Associations”. Semantically, those belong to the same domain. So the bridge supports tag consolidation in configuration:

For Elastic Path, this produces big wins:

Under the hood, each operation has a stable operationId. If the spec defines one, the server uses it. If not, it generates one from the method and path (GET /pxm/products/{id} → get_products_by_id, etc.). When the AI calls:

{
  "tool": "elasticpath_products",
  "arguments": {
    "operation": "get_products_by_id",
    "product_id": "prod_123"
  }
}

The server looks up that operation ID, resolves it to a concrete endpoint, applies authentication, executes the HTTP call, and returns the result.

This preserves full coverage—no operation is lost—while cutting the number of tools and their token footprint dramatically. It also gives the AI a structured mental model: accounts live under elasticpath_accounts, products under elasticpath_products, and so on.

Layer 3: Depth-Based Resource Filtering

Once tools were tamed, the next issue was resources.

Exposing a resource for every GET endpoint quickly becomes unwieldy. In this system, that could have meant 157 separate resources, plus hundreds of additional schema and example resources.

To keep the default surface area manageable, the server uses path depth as a simple but effective filter:

This rule is configurable per API, but in practice it often looks like: “expose resources at depth 3 or less for this domain.” Using that rule reduces resource exposure by about 60–70% while still keeping deeper operations available on demand.

The end result is that a first page of resources behaves more like an API index than a dump: top-level entities and key entry points, not every deeply nested variant.

Layer 4: Documentation Without Flooding the Model

Underneath tools and resources sits a large documentation layer.

The server generates thousands of documentation resources, including:

In total, there are about 2,907 documentation resources. If you simply expose all of them, any search like “create product” returns a wall of docs and examples, often hiding the actual executable operation.

So documentation itself is governed by progressive disclosure.

By default:

If the AI asks “create product”, the result is intentionally not a pile of docs, because the write operation is a tool. If it asks “create product request” or “product schema”, the server can return the corresponding example and schema resources.

Tool descriptions support this with small hints instead of embedded docs:

See docs://ep_products/request for request examples. See docs://ep_products/endpoint for detailed endpoint documentation.

Those lines are cheap in token terms and act as anchors. A single resources/read call pulls in a schema or example only when needed, instead of baking those examples into every tool description.

The result: the default resource view stays around 1.2K tokens instead of the 40K+ it would cost to surface all 2,907 resources at once. Documentation remains fully accessible, but you only pay for it when you actually use it.

Layer 5: Search and Discoverability — From Simple to Intelligent

The last missing piece is search.

Early on, AI clients needed to understand MCP’s split between tools and resources to find anything:

That is a terrible developer experience for an LLM. So the bridge introduces a dedicated search_operations tool that provides a unified search interface across both tools and resources.

A typical call looks like this:

{
  "tool": "search_operations",
  "arguments": {
    "query": "create product",
    "operation_type": "write",  // all, read, or write
    "domain": "Products",
    "limit": 20
  }
}

Under the hood, the search tool:

From the AI’s perspective, there is one search box: “search for operations”, period. It does not need to “know” which MCP endpoint to call for reads vs writes.

To make this fast and robust, the search system itself evolved in phases.

First, I added a tag index and simple analytics:

A small in-memory cache keeps recent queries hot, with an LRU eviction policy and a short TTL.

That alone makes tag-based filtering and repeated searches significantly faster, with nanosecond-level overheads on the indexing and sub-millisecond lookups for most queries.

Second, I built a tag hierarchy and recommendation layer:

This hierarchy is built automatically from the OpenAPI specs and runtime usage patterns; there’s no manual taxonomy work required.

Third, the search engine got smarter about language:

The net effect is that search is forgiving and intent-friendly. “Customer orders” expands into a set of tokens the system actually understands and maps to specific domains and operations.

On top of that, the server annotates resources with metadata that can help clients prioritize what to show first. Annotations like:

{
  "uri": "products://{id}",
  "annotations": {
    "audience": ["assistant"],
    "priority": 0.8
  }
}

allow downstream UIs or agents to sort or filter resources based on audience and importance. Admin-heavy tools can be given higher priority for internal assistants; shopper-oriented resources can be highlighted for customer-facing bots.

Search becomes more than a string match; it becomes a routing layer that connects user intent to the right operation, tool, or doc with minimal ceremony.

Layer 6: Cursor-Based Pagination and Stateless Cursors

All of the above relies on the ability to return some results without returning all results.

Both tools and resources are paginated using stateless, cursor-based pagination defined by the MCP spec. Cursors are base64-encoded JSON blobs containing the offset and a hash of the active filters. The AI treats them as opaque. When it needs more results, it passes the cursor back; if the filters have changed or the cursor is invalid, the server responds with a clear MCP error.

Default page sizes are intentionally small. A first call might return a dozen tools or ten resources. If the AI really needs more, it can ask for them, but the system never assumes that a full dump is the best starting point.

From a performance perspective, cursor encoding and slice selection are cheap: microseconds to low milliseconds. They are not the bottleneck.

Layer 7: Response Optimization Without Breaking Semantics

The final step in the data plane is what comes back from the underlying APIs.

REST APIs often return more than the model needs: hypermedia links, timestamps, internal IDs, deeply nested relationship wrappers. None of that is inherently wrong, but every extra field consumes tokens and dilutes attention.

The server includes a response optimization layer that can be configured per API:

Benchmarks show this running in about 25 microseconds per response, far below any user-visible threshold. In standard mode it often trims 10–20% off large JSON payloads; in compact mode the reduction can exceed 70% for some endpoints.

A small circuit breaker protects production: if optimization starts failing repeatedly—due to malformed responses, schema mismatches, or upstream changes—the breaker opens and the server returns raw responses until the issue is investigated. That keeps safety and debuggability high even when upstream services misbehave.

Layer 8: API-Level Filtering: Automatically Learned from OpenAPI

One subtle but important problem in an AI-driven API bridge is filtering. Every API family supports different query parameters, operators, and syntaxes, and those details usually live in scattered docs or tribal knowledge.

Elastic Path’s Products API, for example, expects JSON:API-style filters like filter[name], filter[status], and filter[price:gte], as well as sort and include. Some APIs support enums; some do not. Some fields accept gte / lte; others are exact match only. Historically, an AI assistant has had to guess and learn by error: try filter[price]=100, get a 400, try price=100, still wrong, eventually stumble into filter[price:gte]=100.

In this server, filters are not guessed. They are generated automatically from the OpenAPI specification and exposed back to the AI as structured capabilities on each tool.

During tool generation, the bridge inspects each operation’s query parameters and builds a filter capability model. It looks for common patterns like filter[name], filter[price:gte], sort, include, page[limit], limit, and per_page. For each parameter it extracts type information (string, number, boolean), enum values when present, validation patterns, and whether the field participates in sorting, inclusion, or pagination. That information is attached to the MCP tool as metadata:

{
  "name": "ep_products",
  "description": "Manage Products operations",
  "filterCapabilities": {
    "supportedFilters": {
      "name": {
        "type": "string",
        "operators": ["eq", "ne", "like"],
        "description": "Filter by product name"
      },
      "status": {
        "type": "string",
        "operators": ["eq", "in"],
        "enum": ["active", "inactive", "draft"]
      },
      "price": {
        "type": "number",
        "operators": ["eq", "gt", "lt", "gte", "lte"]
      }
    },
    "sortableFields": ["name", "created_at", "price"],
    "includableRelations": ["images", "variations"],
    "paginationSupport": true
  }
}

Operators are chosen based on the underlying schema type. Strings get equality and “like” semantics, numbers get comparison operators (gt, lt, gte, lte) and in, booleans keep it simple with equality only, and enums inherit eq and in with their allowed value set pulled directly from the spec. That means the assistant knows, before making a call, that status only accepts active, inactive, draft, or archived, and that price can be filtered with range semantics.

On top of this capability model sits an API-aware validation layer. Before a request goes out, the bridge checks each filter against what the operation actually supports. Unknown fields are rejected with a clear error listing the fields that are supported. Unsupported operators are called out by type (“gte is not valid for a string field”), and enum mismatches include the full list of valid values. Invalid filters never reach the upstream API, which saves both tokens and head-scratching.

The bridge also normalizes filter syntax across very different APIs. The assistant can express filters in a consistent, high-level way (implicit equality, JSON:API bracket syntax, or an explicit field:operator:value form), and the server handles the translation. For Elastic Path, that might mean producing filter[name:like]=*shirt*&filter[price:gte]=100. The mapping is inferred from the OpenAPI parameters rather than hard-coded per provider.

In practice, this changes the shape of interactions. Instead of “try a filter, see a 400, adjust, repeat,” the AI can read filter capabilities from the tool description, construct a valid filter set on the first attempt, and rely on the server to enforce correctness. For a typical “filtered product search” flow that previously burned three to five failed calls, this eliminates 67–80% of the wasted tokens just on filter experimentation and pushes success rates for filtered queries into the mid-90% range.

Conceptually, this is the same PCD philosophy applied one level down: filters are part of the API’s context. Rather than dumping ad-hoc docs into descriptions or relying on external knowledge, the bridge derives that context from the spec, exposes it in a compact machine-readable form, and validates usage before any tokens hit the upstream API.

Layer 9: Context-Aware Authentication and Automatic Token Management

All of the above assumes that calls can actually be made. In reality, the bridge fronts multiple APIs, each with their own OAuth2 rules, scopes, and grant types. Admin operations must not be exposed to shopper tokens. Different providers use client_credentials, authorization_code, implicit, or password flows. Token endpoints are rate-limited and add latency.

Authentication needed to be as progressive and context-aware as the tools themselves.

The server classifies every operation into a business context:

Each context maps to an appropriate OAuth2 strategy:

At call time, the server doesn’t ask the AI to care about any of that. It reads the tool’s context and selects the right auth strategy automatically. Credentials are pulled from provider-prefixed environment variables (for example, ELASTICPATH_CLIENT_ID), and an internal OAuth2 manager handles token acquisition, caching, and refresh.

Tokens are cached per combination of client, grant type, and scope. Each cache entry tracks expiry and a “refresh ahead” time. When a request comes in:

That design produces 99%+ cache hit rates in practice, cutting token endpoint calls by about 99% and reducing steady-state latency from “token fetch + API call” to just “API call”.

Security boundaries are enforced using the same context classification. Clients using shopper-oriented grant types never even see admin tools. Admin-only tools and resources are filtered out based on grant type and provider configuration. The result is a kind of zero-trust posture at the MCP layer: even if a client connects, what it can see and do is constrained by its authentication context.

From the model’s perspective, it just calls elasticpath_products with an operation. The server quietly does the rest.

Real-World Impact

It is easy to talk about patterns. It is more useful to look at numbers.

On token efficiency:

On cost:

On AI performance:

On latency and scalability:

Viewed together, Progressive Context Disclosure turns “we have a giant OpenAPI spec” from a liability into an asset. The API surface remains large, but the way it is exposed is tuned for how LLMs actually work: starting from intent, discovering a domain, selecting a specific operation, pulling examples or docs on demand, and executing an authenticated call.

Closing Thoughts

The paradox in MCP design is that giving an AI more tools does not automatically make it more capable. At scale, “more” usually means:

The Progressive Context Disclosure Pattern is one way to reconcile scale with clarity.

It keeps large API surfaces intact while presenting them in a layered, discoverable structure. It treats context as a finite resource and spends it carefully. It bends the server toward how LLMs actually reason: intent → domain → operation → example → execution.

In this particular implementation, that meant turning: