People no longer search just to click - they ask to decide: “Where can I get cacio e pepe near the Colosseum right now? Do they have gluten-free?” Answer engines (ChatGPT, Gemini, Perplexity, Bing Copilot, Apple Intelligence) compile a single response from multiple sources using retrieval-augmented generation (RAG), vertical knowledge graphs, and structured snippets. If your site is easy to parse, you get included and cited. If not, you’re invisible.

This isn’t classical SEO theatre. It’s machine readability engineering.


What Actually Happens Under the Hood (High-Level Pipeline)

Real-world AI answer engines vary, but the architecture rhymes:

  1. Crawling & Fetching
    • Respect robots.txt, sitemaps, canonical URLs.
    • Prefer static HTML or server-side rendered (SSR/SSG) content. Hydration-only JSON apps get less love unless prerendered.
  2. Content Extraction
    • Boilerplate removal (nav, footers, cookie banners) using DOM heuristics (Readability-like algorithms), visual density, and CSS role/ARIA hints.
    • Main content scoring via tag semantics (<article>, <main>, <h1..h6>, <section>), heading hierarchy, and text-to-link ratios.
  3. Normalization
    • Language detection, de-duplication, canonicalization, currency/unit normalization, timezone resolution (critical for hours/menus/events).
  4. Structuring
    • Parse schema.org JSON-LD/Microdata/RDFa.
    • Promote entity slots (Restaurant, Menu, MenuSection, MenuItem, Offer, Price, HoursSpecification, GeoCoordinates).
    • Build internal knowledge graph edges: (Dish) —servedAt→ (Restaurant) —locatedIn→ (Rome).
  5. Indexing
    • Create sparse/dense embeddings per section/chunk (BM25 + vector embeddings).
    • Store table-like data separately (menu items, prices) to answer structured queries quickly.
  6. Retrieval at Query Time
    • Query understanding → retrieve top-k chunks via hybrid search (BM25 + cosine sim) + schema filters (e.g., @type=MenuItem).
    • Assemble context windows with citations and attribution-friendly snippets.
  7. Generation
    • LLM answers with citations; when structured slots exist, they’re privileged over plain text. If you have structured data, you win tie-breaks.

The theme: structure beats prose.


The Algorithms & Signals That Matter (and How to Feed Them)

1) DOM Semantics → Main Content Scoring

Many extractors score blocks by:

Do:

<main>
  <article>
    <header>
      <h1>Trattoria Aurelia — Roman Classics Since 1978</h1>
      <p class="lede">Handmade pasta, seasonal produce, and late-night hours near the Colosseum.</p>
    </header>
    <section>
      <h2>Menu</h2>
      <ul>
        <li>
          <h3>Cacio e Pepe</h3>
          <p>Fresh tonnarelli with Pecorino Romano DOP and Tellicherry pepper.</p>
          <p><strong>€12</strong></p>
        </li>
      </ul>
    </section>
  </article>
</main>

Don’t: nest the entire page inside <div class="container"> with custom class names only. Models lose strong hints about what’s important.


2) Structured Data (schema.org) → Entity Extraction

LLM pipelines elevate JSON-LD above free text. Use Restaurant + Menu + MenuItem + Offer + OpeningHoursSpecification. Include sameAs, geo, priceRange, servesCuisine, acceptsReservations.

Full Menu Example (JSON-LD)

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Restaurant",
  "name": "Trattoria Aurelia",
  "servesCuisine": ["Italian", "Roman"],
  "priceRange": "€€",
  "telephone": "+39 06 5555 1234",
  "address": {
    "@type": "PostalAddress",
    "streetAddress": "Via dei Fori Imperiali 12",
    "addressLocality": "Rome",
    "addressRegion": "RM",
    "postalCode": "00184",
    "addressCountry": "IT"
  },
  "geo": {"@type": "GeoCoordinates", "latitude": 41.8902, "longitude": 12.4922},
  "acceptsReservations": "Yes",
  "openingHoursSpecification": [
    {"@type": "OpeningHoursSpecification", "dayOfWeek": ["Monday","Tuesday","Wednesday","Thursday","Friday"], "opens": "11:30", "closes": "23:00", "validFrom": "2025-01-01"},
    {"@type": "OpeningHoursSpecification", "dayOfWeek": ["Saturday","Sunday"], "opens": "11:30", "closes": "01:00"}
  ],
  "menu": {
    "@type": "Menu",
    "name": "Dinner Menu — Summer",
    "hasMenuSection": [
      {
        "@type": "MenuSection",
        "name": "Pasta",
        "hasMenuItem": [
          {
            "@type": "MenuItem",
            "name": "Cacio e Pepe",
            "description": "Tonnarelli, Pecorino Romano DOP, black pepper",
            "offers": {"@type": "Offer", "priceCurrency": "EUR", "price": 12}
          },
          {
            "@type": "MenuItem",
            "name": "Amatriciana",
            "description": "Guanciale, tomato, Pecorino Romano",
            "offers": {"@type": "Offer", "priceCurrency": "EUR", "price": 13}
          }
        ]
      }
    ]
  },
  "sameAs": [
    "https://maps.google.com/?cid=...",
    "https://www.instagram.com/trattoriaaurelia",
    "https://www.facebook.com/trattoriaaurelia"
  ]
}
</script>

Why it works: Retrieval systems can slot-answer queries like “price of cacio e pepe at Trattoria Aurelia” without re-parsing prose.


3) Content Chunking → Embedding Recall

LLMs use chunked embeddings. Oversized pages hurt recall; too many tiny chunks lose context.

Practice:

<section id="cacio-e-pepe">
  <h3>Cacio e Pepe</h3>
  <p>...</p>
</section>

4) Tabular Data → Parseable Tables, Not Pictures

If prices/hours appear in tables, keep them as semantic tables or lists. Avoid images/PDFs.

<table>
  <caption>Dinner Prices</caption>
  <thead>
    <tr><th>Dish</th><th>Price (EUR)</th><th>Allergens</th></tr>
  </thead>
  <tbody>
    <tr><td>Cacio e Pepe</td><td>12</td><td>Milk, Gluten</td></tr>
    <tr><td>Amatriciana</td><td>13</td><td>Milk, Pork, Gluten</td></tr>
  </tbody>
</table>

Bonus: mirror the data in JSON-LD so both text and structure exist.


5) Multilingual & Locale Signals → Correct Matching

Query: “Where to eat cacio e pepe near the Colosseum, up to 15 euros?” Make sure your page clarifies language, currency, and timezone.


6) Canonicalization, Sitemaps, and Change Hints

Retrievers prioritize freshness for operational facts (hours, prices, availability).


Non‑Trivial, Production‑Level Patterns

A) Disambiguate Similar Entities with @id and Stable Anchors

If you have two venues (Trastevere vs. Monti), give each a stable @id and specific geo.

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@id": "https://trattoriaaurelia.it/locations/monti#restaurant",
  "@type": "Restaurant",
  "name": "Trattoria Aurelia — Monti",
  "geo": {"@type": "GeoCoordinates", "latitude": 41.894, "longitude": 12.494},
  "address": {"@type": "PostalAddress", "addressLocality": "Rome"}
}
</script>

Now an AI can answer: “Which branch serves cacio e pepe past midnight?” by joining MenuItem with the Monti branch’s hours.

B) Recipe/Allergen Knowledge → DietaryRestriction & SuitableForDiet

{
  "@context": "https://schema.org",
  "@type": "MenuItem",
  "name": "Cacio e Pepe",
  "suitableForDiet": ["https://schema.org/VegetarianDiet"],
  "menuAddOn": [
    {"@type": "MenuItem", "name": "Gluten-free pasta", "offers": {"@type": "Offer", "price": 2, "priceCurrency": "EUR"}}
  ],
  "requiresSubscription": false
}

This lets assistants answer “Is their cacio e pepe vegetarian? Can I get gluten-free?” without hallucinating.

C) Temporal Facts → Validity Windows

Menus change. Model pipelines weight recent structured facts.

{
  "@type": "Offer",
  "price": 12,
  "priceCurrency": "EUR",
  "priceValidUntil": "2025-10-01"
}

Pair with sitemap.xml updates so freshness signals align.

D) FAQ Blocks → Extractive Answer Boosters

LLMs love Q&A pairs. Provide FAQPage with concise answers and link to canonical detail.

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "Do you accept walk-ins after 11pm?",
      "acceptedAnswer": {"@type": "Answer", "text": "Yes, until 12:30am on weekends."}
    }
  ]
}
</script>

E) Robust Images → ALT, Figure Captions, and EXIF Hygiene

{
  "@context": "https://schema.org",
  "@type": "Event",
  "name": "Truffle Week",
  "startDate": "2025-11-02",
  "endDate": "2025-11-09",
  "location": {"@type": "Place", "name": "Trattoria Aurelia — Monti"},
  "offers": {"@type": "Offer", "url": "https://trattoriaaurelia.it/reserve?event=truffle-week"}
}

Assistants can now answer “Any truffle events in Rome next week?” and deep-link bookings.


How This Maps to RAG & Embeddings

Internal Linking = Graph Strength

Think in graphs: link dish → origin story → ingredient sourcing → allergen policy. Each link with a descriptive anchor boosts entity resolution and gives retrievers semantic hops.


Pitfalls That Break AI Parsing (Seen in Production)

  1. Menus as PDFs/JPEGs - invisible. Provide HTML + JSON-LD mirror.
  2. Prices rendered only client-side - crawlers time out on hydration or block XHR; SSR your critical content.
  3. Infinite scroll for core info - put the essentials above-the-fold in the initial HTML.
  4. Locale confusion - 12,00 vs 12.00, ambiguous timezones. Normalize in JSON-LD.
  5. Over-nested div soup - no semantic hints; extractors miss the main content.
  6. Duplicated pages without canonical - embeddings split authority; citations fragment.

A Concrete Checklist

  1. <html lang> set; localized alternates with hreflang.
  2. SSR/SSG for critical text (menu, hours, address, phone).
  3. <main>, <article>, correct H1/H2 hierarchy.
  4. schema.org JSON-LD for your vertical (Restaurant, Product, Event, FAQ, Review).
  5. Each item (dish, product) gets its own <section id="…"> and anchor link.
  6. Tables are <table>, not screenshots.
  7. Sitemap with lastmod; stable canonical URLs.
  8. Use @id, sameAs, geo, openingHoursSpecification.
  9. Provide allergens and dietary flags.
  10. Test with multiple parsers (Google Rich Results test + a Readability clone + your own micro RAG script).

Bonus: Build a Tiny RAG Parser to “Think Like an Answer Engine”

You can locally simulate how assistants will see your page:

  1. Fetch your rendered HTML (SSR).
  2. Run Readability (or a Node clone) to extract main content.
  3. Parse JSON-LD via a micro schema.org extractor.
  4. Chunk by headings; embed each chunk (any open-source SentenceTransformer).
  5. Ask queries; retrieve top-k; verify the text is sufficient without images/JS.

Once your page answers your own local RAG, you’ve cleared the biggest hurdle.


Other Vertical Playbooks


Metrics and Monitoring

If you can’t measure whether assistants picked you as a source, you’re guessing.


Ethical & Practical Notes


Semantic Markup as an AI Contract

Semantic HTML and structured data are not cosmetics; they’re a contract with machine readers. When you honor it - clear semantics, honest structure, fresh metadata - you’re easy to retrieve, easy to cite, and hard to hallucinate.

Good markup today is discoverability tomorrow. Ship it.