sia.hackernoon.com

How an OpenAPI-to-MCP bridge achieved 96% token reduction without losing discoverability.

Powerful models are not enough. The way we expose tools, APIs, and data to those models determines whether an AI assistant feels precise and responsive, or slow, confused, and expensive.

In this post, I’ll walk through the architecture of an OpenAPI-to-MCP bridge I built that converts large REST APIs into MCP tools. The concrete problem was simple to state but hard to solve:

How do you expose hundreds of API operations without blowing up the context window or confusing the AI?

The naive approach treated every API endpoint as its own MCP tool. With a platform like Elastic Path (disclosure: my place of employment) - which exposes hundreds of granular microservice endpoints across dozens of services - this exploded instantly. More than 300 tools, 157 resources, and 36,000 tokens were required just to enumerate what the server could do. All of that overhead was loaded before the user typed a single word.

By rethinking how context is exposed, I cut the initial token footprint down to ~1200 tokens (600–700 tokens can be achieved) + 1,250 tokens of instructions while maintaining full discoverability of every operation and surfacing nearly 2,907 documentation resources in a controlled way. The AI still has access to everything, but it learns about those capabilities progressively instead of all at once.

The pattern that emerged from this work is what I call the Progressive Context Disclosure Pattern, or PCD.

From OpenAPI Specs to Context Explosion

The bridge starts with OpenAPI specs. For a large e-commerce platform, the numbers look roughly like this:

481+ operations in total
157 GET operations that could become “resources”
300+ potential tools if you do one tool per operation

If you convert each operation directly into an MCP tool with full descriptions, schemas, and examples, you end up with tens of thousands of tokens in tool definitions alone. In a 200K-token context, the naive version of this server consumed around 18% of the entire window at initialization.

The cognitive load was just as bad. From the model’s perspective, it saw a wall of similar-looking tools like ep_pxm_products_list, ep_pxm_products_get, ep_pxm_products_create, repeated across dozens of domains. Technically everything was available; practically, almost nothing was obvious.

On top of that sat a huge “shadow surface” of documentation: schemas, request/response examples, domain guides, workflow docs. Expose all of that at once and you easily push token usage past 40K–50K just to describe what exists.

This is exactly the situation PCD is designed to fix.

The Progressive Context Disclosure Pattern (PCD)

PCD is the architecture that emerged as I tried to make this OpenAPI-to-MCP bridge usable at scale.

The core idea:

Expose just enough structure for the AI to find what it needs, and reveal everything else only when requested.

The implementation is layered, not monolithic. The server:

Starts with a discovery-first surface instead of exposing everything
Uses tag-based grouping to organize operations into business-domain tools
Filters resources by path depth so only high-value endpoints become first-class resources
Relies on cursor-based pagination to keep tool and resource lists small
Treats documentation as on-demand context, not default payload
Implements a dedicated search layer so the AI does not have to “know MCP” to find operations
Optimizes API responses so the model sees what matters, not every bit of metadata
And applies context-aware authentication so each call uses the right OAuth2 flow automatically, with security boundaries enforced by design

All of this sits on a multi-tenant, configuration-driven, MCP-spec compliant server with circuit breakers protecting production.

Layer 1: Discovery-First, Not “Everything-First”

The first shift was philosophical. Instead of turning every endpoint into a separate tool and exposing them all, I introduced the concept of discovery tools.

On initial load, the server exposes a small, fixed set of tools whose entire purpose is to help the model discover everything else:

A tool to list available tags
A unified search tool for operations
Tools to list and read MCP resources
A tool to list built-in prompts (workflow templates)
A batch execution tool for running multiple calls in parallel

This is the “front door” of the server. Everything else - hundreds of operations and thousands of documentation resources - is reachable through those tools, not pre-loaded as separate tool definitions.

A typical flow looks like:

The AI calls search_operations or available_tags with a query such as “products”, “accounts”, or “create product”.
The server searches across all OpenAPI-derived operations (and optionally documentation) and returns a small, filtered set of matches.
If the AI needs more detail, it calls mcp_resource_read on a docs:// URI to fetch request/response examples, schemas, or conceptual docs.
When it is ready to act, it calls the relevant domain tool (for example, elasticpath_products) with a specific operation value.

From the model’s perspective, the surface area is small and stable: a way to search operations, explore tags, list resources, read docs, enumerate prompts, and batch requests. Everything else is discovered through those entry points.

In raw numbers, this discovery-first layer alone shrinks the initial token exposure from about 36,000 tokens to ~500–600 aggressively tuned. That’s a 96–97% reduction before we even optimize how tools are organized.

Layer 2: Tag-Based Grouping — Intelligent Operation Organization

The next problem was tool organization. A flat list of hundreds of tools is not just expensive; it’s cognitively hostile.

The traditional mapping of one tool per operation doesn’t scale. Two hundred operations become two hundred tools. You end up with 17,000+ tokens just for tool definitions, and the AI has to scan a haystack of similarly named tools to find the needle it needs.

The bridge uses OpenAPI tags as the primary dimension for grouping.

In the spec, operations might look like this:

paths:
  /products:
    get:
      tags: ["Products"]
      operationId: "listProducts"
      
  /products/{id}:
    get:
      tags: ["Products"]
      operationId: "getProduct"
    put:
      tags: ["Products"]
      operationId: "updateProduct"

Everything tagged "Products" becomes part of a single MCP tool, such as elasticpath_products. That tool exposes an operation enum parameter that controls which underlying endpoint is executed.

Real-world specs often fragment tags: “Accounts”, “Account Members”, “Account Membership”, “Account Addresses”, “Account Tags”, “Account Cart Associations”. Semantically, those belong to the same domain. So the bridge supports tag consolidation in configuration:

Multiple OpenAPI tags map to one consolidated domain name (for example, accounts, products).
Each consolidated domain becomes a single tool (for example, elasticpath_accounts).
The tool exposes an operation enum listing all supported operations.

For Elastic Path, this produces big wins:

The PXM API goes from 50+ tools to 7
Overall tool count drops from around 85–93 down to 66
On average, consolidated APIs see a 65% reduction in tool count

Under the hood, each operation has a stable operationId. If the spec defines one, the server uses it. If not, it generates one from the method and path (GET /pxm/products/{id} → get_products_by_id, etc.). When the AI calls:

{
  "tool": "elasticpath_products",
  "arguments": {
    "operation": "get_products_by_id",
    "product_id": "prod_123"
  }
}

The server looks up that operation ID, resolves it to a concrete endpoint, applies authentication, executes the HTTP call, and returns the result.

This preserves full coverage—no operation is lost—while cutting the number of tools and their token footprint dramatically. It also gives the AI a structured mental model: accounts live under elasticpath_accounts, products under elasticpath_products, and so on.

Layer 3: Depth-Based Resource Filtering

Once tools were tamed, the next issue was resources.

Exposing a resource for every GET endpoint quickly becomes unwieldy. In this system, that could have meant 157 separate resources, plus hundreds of additional schema and example resources.

To keep the default surface area manageable, the server uses path depth as a simple but effective filter:

Shallow paths like /pcm/products or /pcm/products/{id} are promoted to resources and appear in resources/list.
Deep paths like /pcm/hierarchies/{id}/products/{pid} remain callable through tools and searchable via discovery, but do not appear as first-class resources by default.

This rule is configurable per API, but in practice it often looks like: “expose resources at depth 3 or less for this domain.” Using that rule reduces resource exposure by about 60–70% while still keeping deeper operations available on demand.

The end result is that a first page of resources behaves more like an API index than a dump: top-level entities and key entry points, not every deeply nested variant.

Layer 4: Documentation Without Flooding the Model

Underneath tools and resources sits a large documentation layer.

The server generates thousands of documentation resources, including:

Schema definitions like docs://schemas/elasticpath/pxm/Product
Request and response examples such as docs://ep_products:createproduct/request
Endpoint documentation attached to tools
Domain-level conceptual docs like llms://pxm/Products/docs
Workflow guides under docs://workflows/elasticpath/pxm/create-product-workflow

In total, there are about 2,907 documentation resources. If you simply expose all of them, any search like “create product” returns a wall of docs and examples, often hiding the actual executable operation.

So documentation itself is governed by progressive disclosure.

By default:

resources/list and basic discovery flows return operation resources only (essentially GET entry points).
Documentation is returned only when the AI’s query clearly indicates it is looking for docs, schemas, examples, or guides (keywords like “schema”, “request body”, “example”, “docs”), or when the AI explicitly filters for documentation (filter_type="documentation", doc_type="schema", etc.).

If the AI asks “create product”, the result is intentionally not a pile of docs, because the write operation is a tool. If it asks “create product request” or “product schema”, the server can return the corresponding example and schema resources.

Tool descriptions support this with small hints instead of embedded docs:

See docs://ep_products/request for request examples. See docs://ep_products/endpoint for detailed endpoint documentation.

Those lines are cheap in token terms and act as anchors. A single resources/read call pulls in a schema or example only when needed, instead of baking those examples into every tool description.

The result: the default resource view stays around 1.2K tokens instead of the 40K+ it would cost to surface all 2,907 resources at once. Documentation remains fully accessible, but you only pay for it when you actually use it.

Layer 5: Search and Discoverability — From Simple to Intelligent

The last missing piece is search.

Early on, AI clients needed to understand MCP’s split between tools and resources to find anything:

Use resources/list to find GET operations.
Use tools/list to find write operations.
Manually reconcile the two based on names and descriptions

That is a terrible developer experience for an LLM. So the bridge introduces a dedicated search_operations tool that provides a unified search interface across both tools and resources.

A typical call looks like this:

{
  "tool": "search_operations",
  "arguments": {
    "query": "create product",
    "operation_type": "write",  // all, read, or write
    "domain": "Products",
    "limit": 20
  }
}

Under the hood, the search tool:

Matches individual operations within grouped tools (for example, create_product inside elasticpath_products).
Searches resources and documentation where relevant.
Returns results that clearly indicate whether each match is a tool, operation, or resource.

From the AI’s perspective, there is one search box: “search for operations”, period. It does not need to “know” which MCP endpoint to call for reads vs writes.

To make this fast and robust, the search system itself evolved in phases.

First, I added a tag index and simple analytics:

Tags are indexed in O(1) structures so the server can jump directly from a tag to the tools and operations that use it.
Basic popularity scores track how often tags and operations are used.

A small in-memory cache keeps recent queries hot, with an LRU eviction policy and a short TTL.

That alone makes tag-based filtering and repeated searches significantly faster, with nanosecond-level overheads on the indexing and sub-millisecond lookups for most queries.

Second, I built a tag hierarchy and recommendation layer:

Parent–child relationships are inferred from simple heuristics (for example, “Product” vs “Product Variations”).
Sibling relationships are inferred from co-occurrence: tags that appear together in many operations are considered related.
These relationships drive simple recommendations: if you search for “products”, the system can suggest closely related domains (“inventory”, “pricebooks”, etc.) based on observed co-usage.

This hierarchy is built automatically from the OpenAPI specs and runtime usage patterns; there’s no manual taxonomy work required.

Third, the search engine got smarter about language:

A lightweight fuzzy matcher (based on trigram similarity) kicks in when a search returns no results, correcting obvious typos like “prodcut” → “product”.
A synonym dictionary accounts for common domain language: “customer” vs “shopper” vs “user”, “orders” vs “purchases” vs “transactions”, “cart” vs “basket”, and so on.
The query pipeline normalizes basic morphology (singular vs plural) and strips stopwords.

The net effect is that search is forgiving and intent-friendly. “Customer orders” expands into a set of tokens the system actually understands and maps to specific domains and operations.

On top of that, the server annotates resources with metadata that can help clients prioritize what to show first. Annotations like:

{
  "uri": "products://{id}",
  "annotations": {
    "audience": ["assistant"],
    "priority": 0.8
  }
}

allow downstream UIs or agents to sort or filter resources based on audience and importance. Admin-heavy tools can be given higher priority for internal assistants; shopper-oriented resources can be highlighted for customer-facing bots.

Search becomes more than a string match; it becomes a routing layer that connects user intent to the right operation, tool, or doc with minimal ceremony.

Layer 6: Cursor-Based Pagination and Stateless Cursors

All of the above relies on the ability to return some results without returning all results.

Both tools and resources are paginated using stateless, cursor-based pagination defined by the MCP spec. Cursors are base64-encoded JSON blobs containing the offset and a hash of the active filters. The AI treats them as opaque. When it needs more results, it passes the cursor back; if the filters have changed or the cursor is invalid, the server responds with a clear MCP error.

Default page sizes are intentionally small. A first call might return a dozen tools or ten resources. If the AI really needs more, it can ask for them, but the system never assumes that a full dump is the best starting point.

From a performance perspective, cursor encoding and slice selection are cheap: microseconds to low milliseconds. They are not the bottleneck.

Layer 7: Response Optimization Without Breaking Semantics

The final step in the data plane is what comes back from the underlying APIs.

REST APIs often return more than the model needs: hypermedia links, timestamps, internal IDs, deeply nested relationship wrappers. None of that is inherently wrong, but every extra field consumes tokens and dilutes attention.

The server includes a response optimization layer that can be configured per API:

In “full” mode, responses are passed through unmodified.
In “standard” mode, known-noisy fields like _links, certain _meta subfields, and redundant type markers are stripped.
In “compact” mode, responses are projected down to a minimal set of fields (for example, id, sku, name, description), and relationship structures are compacted.

Benchmarks show this running in about 25 microseconds per response, far below any user-visible threshold. In standard mode it often trims 10–20% off large JSON payloads; in compact mode the reduction can exceed 70% for some endpoints.

A small circuit breaker protects production: if optimization starts failing repeatedly—due to malformed responses, schema mismatches, or upstream changes—the breaker opens and the server returns raw responses until the issue is investigated. That keeps safety and debuggability high even when upstream services misbehave.

Layer 8: API-Level Filtering: Automatically Learned from OpenAPI

One subtle but important problem in an AI-driven API bridge is filtering. Every API family supports different query parameters, operators, and syntaxes, and those details usually live in scattered docs or tribal knowledge.

Elastic Path’s Products API, for example, expects JSON:API-style filters like filter[name], filter[status], and filter[price:gte], as well as sort and include. Some APIs support enums; some do not. Some fields accept gte / lte; others are exact match only. Historically, an AI assistant has had to guess and learn by error: try filter[price]=100, get a 400, try price=100, still wrong, eventually stumble into filter[price:gte]=100.

In this server, filters are not guessed. They are generated automatically from the OpenAPI specification and exposed back to the AI as structured capabilities on each tool.

During tool generation, the bridge inspects each operation’s query parameters and builds a filter capability model. It looks for common patterns like filter[name], filter[price:gte], sort, include, page[limit], limit, and per_page. For each parameter it extracts type information (string, number, boolean), enum values when present, validation patterns, and whether the field participates in sorting, inclusion, or pagination. That information is attached to the MCP tool as metadata:

{
  "name": "ep_products",
  "description": "Manage Products operations",
  "filterCapabilities": {
    "supportedFilters": {
      "name": {
        "type": "string",
        "operators": ["eq", "ne", "like"],
        "description": "Filter by product name"
      },
      "status": {
        "type": "string",
        "operators": ["eq", "in"],
        "enum": ["active", "inactive", "draft"]
      },
      "price": {
        "type": "number",
        "operators": ["eq", "gt", "lt", "gte", "lte"]
      }
    },
    "sortableFields": ["name", "created_at", "price"],
    "includableRelations": ["images", "variations"],
    "paginationSupport": true
  }
}

Operators are chosen based on the underlying schema type. Strings get equality and “like” semantics, numbers get comparison operators (gt, lt, gte, lte) and in, booleans keep it simple with equality only, and enums inherit eq and in with their allowed value set pulled directly from the spec. That means the assistant knows, before making a call, that status only accepts active, inactive, draft, or archived, and that price can be filtered with range semantics.

On top of this capability model sits an API-aware validation layer. Before a request goes out, the bridge checks each filter against what the operation actually supports. Unknown fields are rejected with a clear error listing the fields that are supported. Unsupported operators are called out by type (“gte is not valid for a string field”), and enum mismatches include the full list of valid values. Invalid filters never reach the upstream API, which saves both tokens and head-scratching.

The bridge also normalizes filter syntax across very different APIs. The assistant can express filters in a consistent, high-level way (implicit equality, JSON:API bracket syntax, or an explicit field:operator:value form), and the server handles the translation. For Elastic Path, that might mean producing filter[name:like]=*shirt*&filter[price:gte]=100. The mapping is inferred from the OpenAPI parameters rather than hard-coded per provider.

In practice, this changes the shape of interactions. Instead of “try a filter, see a 400, adjust, repeat,” the AI can read filter capabilities from the tool description, construct a valid filter set on the first attempt, and rely on the server to enforce correctness. For a typical “filtered product search” flow that previously burned three to five failed calls, this eliminates 67–80% of the wasted tokens just on filter experimentation and pushes success rates for filtered queries into the mid-90% range.

Conceptually, this is the same PCD philosophy applied one level down: filters are part of the API’s context. Rather than dumping ad-hoc docs into descriptions or relying on external knowledge, the bridge derives that context from the spec, exposes it in a compact machine-readable form, and validates usage before any tokens hit the upstream API.

Layer 9: Context-Aware Authentication and Automatic Token Management

All of the above assumes that calls can actually be made. In reality, the bridge fronts multiple APIs, each with their own OAuth2 rules, scopes, and grant types. Admin operations must not be exposed to shopper tokens. Different providers use client_credentials, authorization_code, implicit, or password flows. Token endpoints are rate-limited and add latency.

Authentication needed to be as progressive and context-aware as the tools themselves.

The server classifies every operation into a business context:

Admin operations: product creation, order administration, configuration changes.
Shopper operations: browsing, carts, checkout, customer-level actions.
System operations: webhooks, background integrations, service tokens.

Each context maps to an appropriate OAuth2 strategy:

Admin and system operations typically use client_credentials, with different scopes and sometimes different client IDs.
Shopper operations use implicit, customer tokens, or flows derived from customer identity.
Some providers use authorization_code for admin APIs and public access with optional tokens for other APIs.

At call time, the server doesn’t ask the AI to care about any of that. It reads the tool’s context and selects the right auth strategy automatically. Credentials are pulled from provider-prefixed environment variables (for example, ELASTICPATH_CLIENT_ID), and an internal OAuth2 manager handles token acquisition, caching, and refresh.

Tokens are cached per combination of client, grant type, and scope. Each cache entry tracks expiry and a “refresh ahead” time. When a request comes in:

If a valid token exists, it’s reused.
If the token is nearing expiry, it is refreshed in the background.
If the token is expired or missing, a new one is fetched immediately.

That design produces 99%+ cache hit rates in practice, cutting token endpoint calls by about 99% and reducing steady-state latency from “token fetch + API call” to just “API call”.

Security boundaries are enforced using the same context classification. Clients using shopper-oriented grant types never even see admin tools. Admin-only tools and resources are filtered out based on grant type and provider configuration. The result is a kind of zero-trust posture at the MCP layer: even if a client connects, what it can see and do is constrained by its authentication context.

From the model’s perspective, it just calls elasticpath_products with an operation. The server quietly does the rest.

Real-World Impact

It is easy to talk about patterns. It is more useful to look at numbers.

On token efficiency:

Initial tool load dropped from around 17,000 tokens to roughly 1,200, a 93% reduction.
Average tool description shrank from ~500 tokens to ~150 tokens, a 70% reduction thanks to grouping, budgeting, and progressive docs.
Total tokens per conversation fell from the 25K+ range into the 3K–5K range, an 80–88% reduction for typical workflows.

On cost:

Before: about $0.051 per conversation just for tool definitions (17K tokens × $0.003 / 1K), or $1,530 per month at 1,000 conversations per day.
After: around $0.0036 per conversation for the same surface area (1.2K tokens), or $108 per month at 1,000 conversations per day.
That is roughly $1,422 per month saved, a 93% reduction in this slice of API spend.

On AI performance:

Success rates went from around 40% (in early resource-first experiments) to roughly 95% after combining discovery-first design, grouping, search, and progressive docs.
Tool utilization increased from 12% of available tools being used to around 45%, meaning more of the server’s capabilities are actually being exercised.
Average response times dropped from about 2.3 seconds to 0.8 seconds, a 65% latency improvement, thanks to smaller payloads, caching, and reduced tool noise.

On latency and scalability:

Tool list generation sits in the tens of milliseconds even for large APIs.
Resource reads are typically 15–25ms.
Response optimization runs in microseconds.
Token endpoint traffic is reduced by roughly 99% due to caching.

Viewed together, Progressive Context Disclosure turns “we have a giant OpenAPI spec” from a liability into an asset. The API surface remains large, but the way it is exposed is tuned for how LLMs actually work: starting from intent, discovering a domain, selecting a specific operation, pulling examples or docs on demand, and executing an authenticated call.

Closing Thoughts

The paradox in MCP design is that giving an AI more tools does not automatically make it more capable. At scale, “more” usually means:

More tokens consumed
More descriptions to scan
More chances to pick the wrong tool
More latency and cost

The Progressive Context Disclosure Pattern is one way to reconcile scale with clarity.

It keeps large API surfaces intact while presenting them in a layered, discoverable structure. It treats context as a finite resource and spends it carefully. It bends the server toward how LLMs actually reason: intent → domain → operation → example → execution.

In this particular implementation, that meant turning:

481 operations into 66 domain tools plus a handful of discovery tools
36,000+ initial tokens into 600–700 + 1,250 for instructions
2,907 documentation pages into an on-demand library
And OAuth2 flows into a context-aware auth layer that the AI never has to think about

Progressive Context Disclosure and the Token Problem