People no longer search just to click - they ask to decide: “Where can I get cacio e pepe near the Colosseum right now? Do they have gluten-free?” Answer engines (ChatGPT, Gemini, Perplexity, Bing Copilot, Apple Intelligence) compile a single response from multiple sources using retrieval-augmented generation (RAG), vertical knowledge graphs, and structured snippets. If your site is easy to parse, you get included and cited. If not, you’re invisible.
This isn’t classical SEO theatre. It’s machine readability engineering.
What Actually Happens Under the Hood (High-Level Pipeline)
Real-world AI answer engines vary, but the architecture rhymes:
- Crawling & Fetching
- Respect
robots.txt
, sitemaps, canonical URLs. - Prefer static HTML or server-side rendered (SSR/SSG) content. Hydration-only JSON apps get less love unless prerendered.
- Respect
- Content Extraction
- Boilerplate removal (nav, footers, cookie banners) using DOM heuristics (Readability-like algorithms), visual density, and CSS role/ARIA hints.
- Main content scoring via tag semantics (
<article>
,<main>
,<h1..h6>
,<section>
), heading hierarchy, and text-to-link ratios.
- Normalization
- Language detection, de-duplication, canonicalization, currency/unit normalization, timezone resolution (critical for hours/menus/events).
- Structuring
- Parse schema.org JSON-LD/Microdata/RDFa.
- Promote entity slots (Restaurant, Menu, MenuSection, MenuItem, Offer, Price, HoursSpecification, GeoCoordinates).
- Build internal knowledge graph edges: (Dish) —servedAt→ (Restaurant) —locatedIn→ (Rome).
- Indexing
- Create sparse/dense embeddings per section/chunk (BM25 + vector embeddings).
- Store table-like data separately (menu items, prices) to answer structured queries quickly.
- Retrieval at Query Time
- Query understanding → retrieve top-k chunks via hybrid search (BM25 + cosine sim) + schema filters (e.g.,
@type=MenuItem
). - Assemble context windows with citations and attribution-friendly snippets.
- Query understanding → retrieve top-k chunks via hybrid search (BM25 + cosine sim) + schema filters (e.g.,
- Generation
- LLM answers with citations; when structured slots exist, they’re privileged over plain text. If you have structured data, you win tie-breaks.
The theme: structure beats prose.
The Algorithms & Signals That Matter (and How to Feed Them)
1) DOM Semantics → Main Content Scoring
Many extractors score blocks by:
- Heading depth (
<h1>
near the top, reasonable<h2>
/<h3>
nesting). - Semantic containers (
<article>
,<main>
,<section role="region">
). - Density (characters per block, lower link density, fewer repeated patterns).
- ARIA roles (
role="main"
,role="article"
).
Do:
<main>
<article>
<header>
<h1>Trattoria Aurelia — Roman Classics Since 1978</h1>
<p class="lede">Handmade pasta, seasonal produce, and late-night hours near the Colosseum.</p>
</header>
<section>
<h2>Menu</h2>
<ul>
<li>
<h3>Cacio e Pepe</h3>
<p>Fresh tonnarelli with Pecorino Romano DOP and Tellicherry pepper.</p>
<p><strong>€12</strong></p>
</li>
</ul>
</section>
</article>
</main>
Don’t: nest the entire page inside <div class="container">
with custom class names only. Models lose strong hints about what’s important.
2) Structured Data (schema.org) → Entity Extraction
LLM pipelines elevate JSON-LD above free text. Use Restaurant
+ Menu
+ MenuItem
+ Offer
+ OpeningHoursSpecification
. Include sameAs
, geo
, priceRange
, servesCuisine
, acceptsReservations
.
Full Menu Example (JSON-LD)
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Restaurant",
"name": "Trattoria Aurelia",
"servesCuisine": ["Italian", "Roman"],
"priceRange": "€€",
"telephone": "+39 06 5555 1234",
"address": {
"@type": "PostalAddress",
"streetAddress": "Via dei Fori Imperiali 12",
"addressLocality": "Rome",
"addressRegion": "RM",
"postalCode": "00184",
"addressCountry": "IT"
},
"geo": {"@type": "GeoCoordinates", "latitude": 41.8902, "longitude": 12.4922},
"acceptsReservations": "Yes",
"openingHoursSpecification": [
{"@type": "OpeningHoursSpecification", "dayOfWeek": ["Monday","Tuesday","Wednesday","Thursday","Friday"], "opens": "11:30", "closes": "23:00", "validFrom": "2025-01-01"},
{"@type": "OpeningHoursSpecification", "dayOfWeek": ["Saturday","Sunday"], "opens": "11:30", "closes": "01:00"}
],
"menu": {
"@type": "Menu",
"name": "Dinner Menu — Summer",
"hasMenuSection": [
{
"@type": "MenuSection",
"name": "Pasta",
"hasMenuItem": [
{
"@type": "MenuItem",
"name": "Cacio e Pepe",
"description": "Tonnarelli, Pecorino Romano DOP, black pepper",
"offers": {"@type": "Offer", "priceCurrency": "EUR", "price": 12}
},
{
"@type": "MenuItem",
"name": "Amatriciana",
"description": "Guanciale, tomato, Pecorino Romano",
"offers": {"@type": "Offer", "priceCurrency": "EUR", "price": 13}
}
]
}
]
},
"sameAs": [
"https://maps.google.com/?cid=...",
"https://www.instagram.com/trattoriaaurelia",
"https://www.facebook.com/trattoriaaurelia"
]
}
</script>
Why it works: Retrieval systems can slot-answer queries like “price of cacio e pepe at Trattoria Aurelia” without re-parsing prose.
3) Content Chunking → Embedding Recall
LLMs use chunked embeddings. Oversized pages hurt recall; too many tiny chunks lose context.
Practice:
- Aim for 300–800 tokens per logical section (roughly 1–4 paragraphs + a table).
- Use headings and
id
anchors; keep each dish/FAQ in its own subsection. - Provide permalink anchors (e.g.,
/menu#cacio-e-pepe
) so retrievers can cite precise spans.
<section id="cacio-e-pepe">
<h3>Cacio e Pepe</h3>
<p>...</p>
</section>
4) Tabular Data → Parseable Tables, Not Pictures
If prices/hours appear in tables, keep them as semantic tables or lists. Avoid images/PDFs.
<table>
<caption>Dinner Prices</caption>
<thead>
<tr><th>Dish</th><th>Price (EUR)</th><th>Allergens</th></tr>
</thead>
<tbody>
<tr><td>Cacio e Pepe</td><td>12</td><td>Milk, Gluten</td></tr>
<tr><td>Amatriciana</td><td>13</td><td>Milk, Pork, Gluten</td></tr>
</tbody>
</table>
Bonus: mirror the data in JSON-LD so both text and structure exist.
5) Multilingual & Locale Signals → Correct Matching
Query: “Where to eat cacio e pepe near the Colosseum, up to 15 euros?” Make sure your page clarifies language, currency, and timezone.
- Use
lang
on<html>
andhreflang
for alternates. - Normalize currency with ISO codes in JSON-LD (
priceCurrency: "EUR"
). - Provide local phone formats and openingHoursSpecification with
validFrom/Through
if seasonal.
6) Canonicalization, Sitemaps, and Change Hints
Retrievers prioritize freshness for operational facts (hours, prices, availability).
- Add
lastmod
to sitemaps and keep it honest. - Use
<link rel="canonical">
to avoid duplicate menus across UTM’d pages. - Publish invalidation-friendly URLs:
/menu/summer-2025
rather than/menu?date=1699
. - Avoid JS-only rendering for price text; SSR it.
Non‑Trivial, Production‑Level Patterns
A) Disambiguate Similar Entities with @id
and Stable Anchors
If you have two venues (Trastevere vs. Monti), give each a stable @id
and specific geo
.
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@id": "https://trattoriaaurelia.it/locations/monti#restaurant",
"@type": "Restaurant",
"name": "Trattoria Aurelia — Monti",
"geo": {"@type": "GeoCoordinates", "latitude": 41.894, "longitude": 12.494},
"address": {"@type": "PostalAddress", "addressLocality": "Rome"}
}
</script>
Now an AI can answer: “Which branch serves cacio e pepe past midnight?” by joining MenuItem
with the Monti branch’s hours.
B) Recipe/Allergen Knowledge → DietaryRestriction
& SuitableForDiet
{
"@context": "https://schema.org",
"@type": "MenuItem",
"name": "Cacio e Pepe",
"suitableForDiet": ["https://schema.org/VegetarianDiet"],
"menuAddOn": [
{"@type": "MenuItem", "name": "Gluten-free pasta", "offers": {"@type": "Offer", "price": 2, "priceCurrency": "EUR"}}
],
"requiresSubscription": false
}
This lets assistants answer “Is their cacio e pepe vegetarian? Can I get gluten-free?” without hallucinating.
C) Temporal Facts → Validity Windows
Menus change. Model pipelines weight recent structured facts.
{
"@type": "Offer",
"price": 12,
"priceCurrency": "EUR",
"priceValidUntil": "2025-10-01"
}
Pair with sitemap.xml
updates so freshness signals align.
D) FAQ Blocks → Extractive Answer Boosters
LLMs love Q&A pairs. Provide FAQPage with concise answers and link to canonical detail.
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "Do you accept walk-ins after 11pm?",
"acceptedAnswer": {"@type": "Answer", "text": "Yes, until 12:30am on weekends."}
}
]
}
</script>
E) Robust Images → ALT, Figure Captions, and EXIF Hygiene
- Use
<figure><img alt="Cacio e pepe with Pecorino Romano"><figcaption>…</figcaption></figure>
. - Keep filenames descriptive (
cacio-e-pepe-pecorino.jpg
). - Don’t bake text into images; assistants can’t extract it reliably.
F) Events & Reservations → Event
+ Deep Links
{
"@context": "https://schema.org",
"@type": "Event",
"name": "Truffle Week",
"startDate": "2025-11-02",
"endDate": "2025-11-09",
"location": {"@type": "Place", "name": "Trattoria Aurelia — Monti"},
"offers": {"@type": "Offer", "url": "https://trattoriaaurelia.it/reserve?event=truffle-week"}
}
Assistants can now answer “Any truffle events in Rome next week?” and deep-link bookings.
How This Maps to RAG & Embeddings
- Hybrid Retrieval: Systems combine classical IR (BM25) with dense vectors. Your job: make exact tokens discoverable (dish names, neighborhoods, hours) and provide rich context for vectors.
- Chunk Boundaries: Use headings and logical grouping; don’t interleave unrelated content in one paragraph (e.g., lunch and dinner prices together).
- Anchor-able Citations: Provide stable anchors so assistants can attribute precisely - this improves the chance you’re cited.
Internal Linking = Graph Strength
Think in graphs: link dish → origin story → ingredient sourcing → allergen policy. Each link with a descriptive anchor boosts entity resolution and gives retrievers semantic hops.
Pitfalls That Break AI Parsing (Seen in Production)
- Menus as PDFs/JPEGs - invisible. Provide HTML + JSON-LD mirror.
- Prices rendered only client-side - crawlers time out on hydration or block XHR; SSR your critical content.
- Infinite scroll for core info - put the essentials above-the-fold in the initial HTML.
- Locale confusion -
12,00
vs12.00
, ambiguous timezones. Normalize in JSON-LD. - Over-nested div soup - no semantic hints; extractors miss the main content.
- Duplicated pages without canonical - embeddings split authority; citations fragment.
A Concrete Checklist
<html lang>
set; localized alternates withhreflang
.- SSR/SSG for critical text (menu, hours, address, phone).
<main>
,<article>
, correct H1/H2 hierarchy.schema.org
JSON-LD for your vertical (Restaurant, Product, Event, FAQ, Review).- Each item (dish, product) gets its own
<section id="…">
and anchor link. - Tables are
<table>
, not screenshots. - Sitemap with
lastmod
; stable canonical URLs. - Use
@id
,sameAs
,geo
,openingHoursSpecification
. - Provide allergens and dietary flags.
- Test with multiple parsers (Google Rich Results test + a Readability clone + your own micro RAG script).
Bonus: Build a Tiny RAG Parser to “Think Like an Answer Engine”
You can locally simulate how assistants will see your page:
- Fetch your rendered HTML (SSR).
- Run Readability (or a Node clone) to extract main content.
- Parse JSON-LD via a micro schema.org extractor.
- Chunk by headings; embed each chunk (any open-source SentenceTransformer).
- Ask queries; retrieve top-k; verify the text is sufficient without images/JS.
Once your page answers your own local RAG, you’ve cleared the biggest hurdle.
Other Vertical Playbooks
- E-commerce:
Product
,Offer
,AggregateRating
,Review
,ItemAvailability
,Brand
,GTIN
,color/size
variants as separate@id
nodes. - Events:
Event
withstartDate/endDate
,eventAttendanceMode
,location.geo
, ticketOffer
with currency andavailabilityEnds
. - Local Services:
LocalBusiness
withService
items (areaServed
,hasOfferCatalog
). - Docs & APIs:
TechArticle
,HowTo
, stable permalinks per endpoint, code blocks labeled, andFAQPage
for common errors.
Metrics and Monitoring
- Track assistant referrals (UTMs per assistant, e.g.,
?ref=assistant-chatgpt
). - Measure crawl frequency via log analysis (user-agents like “GPTBot”, “CCBot”, “PerplexityBot”, etc.).
- Watch index freshness: time from deploy → assistant mentions updated price.
- A/B test schema richness on a subsection of pages.
If you can’t measure whether assistants picked you as a source, you’re guessing.
Ethical & Practical Notes
- Respect accessibility: what helps screen readers helps machines.
- Avoid deceptive markup - penalties are coming.
- Publish privacy-respecting structured data (no dark patterns in JSON-LD).
Semantic Markup as an AI Contract
Semantic HTML and structured data are not cosmetics; they’re a contract with machine readers. When you honor it - clear semantics, honest structure, fresh metadata - you’re easy to retrieve, easy to cite, and hard to hallucinate.
Good markup today is discoverability tomorrow. Ship it.