Chunk-based RAG is broken for structured documents. The fix is simpler than you think - and faster than the original.


A few weeks ago, I came across an article by Agent Native about vectorless RAG. The framing stuck with me: most RAG systems turn documents into “semantic confetti” — chunk everything, embed everything, then hope an ANN search surfaces the right bits. For large document bases, this becomes semantic hide-and-seek across thousands of chunks, burning tokens and confidently hallucinating near the answer.

Upon digging deeper, PageIndex from VectifyAI had the perfect implementation as an alternative approach. Instead of embedding chunks, it treats the document’s own heading structure as the retrieval primitive. Represent the document as a hierarchical tree, hand the outline to your LLM, let it navigate to the right section, pull that section’s text. No embeddings. No ANN. Just the document telling you how it’s organized.

I had been building agents over financial documents and hitting exactly this problem. I tried PageIndex, it worked, and then I rewrote it in Rust.

This is the story of what happened.

Why chunk-based RAG fails on structured documents

Take a 10-K filing. It has a section on risk factors, inside which there’s a subsection on liquidity risk, inside which there’s a paragraph about covenant breaches. When you split this into 512-token chunks, those three levels of context get shattered. The chunk about covenant breaches no longer knows it’s inside liquidity risk, which is inside risk factors.

At query time, “what are the company’s covenant breach risks” might surface three chunks from different sections that share vocabulary but don’t form a coherent answer. The retrieval is technically close but contextually wrong. You end up with an LLM that has all the right words and none of the right context.

Structured documents — financial reports, legal filings, technical manuals, research papers — already tell you how they’re organized. Every heading is a natural retrieval boundary. PageIndex just respects that structure.

How PageIndex works

The approach is straightforward. Parse the markdown document into a tree of nodes, one per heading. Each node holds its title, body text, and children. Generate a compact outline of the tree. At query time:

  1. Send the outline to your LLM with the question
  2. Ask it to return the node ID of the most relevant section
  3. Fetch that node directly
  4. Pass the node’s text to your LLM for the final answer

The outline looks like this:

[1] Annual Report 2023
[1.1] Financial Highlights
[1.2] Risk Factors
[1.2.1] Market Risk
[1.2.2] Liquidity Risk
[1.2.3] Regulatory Risk
[1.3] Management Discussion

The LLM reads this and says “1.2.2” — you fetch that node and you’re done. Precise, explainable, and no embedding infrastructure required.

VectifyAI’s Mafin 2.5 system, powered by PageIndex, achieved 98.7% accuracy on the FinanceBench benchmark. That’s the practical proof that the approach works at scale.

Why I rewrote it in Rust

A few reasons. I had already built fastrustrag — a Rust library for document deduplication that achieved 8–121x speedups over Python’s datasketch — so I had the toolchain and the workflow ready. I was also skeptical that the Python implementation would hold up under load, specifically for the index build and node retrieval operations that happen on every query.

Before writing a line of Rust I validated that there was actually a performance problem worth solving. The methodology I’ve been using for these projects: always benchmark the Python implementation first, identify the bottleneck, then build the Rust version. Don’t rewrite things for fun.

For PageIndex specifically, the bottleneck I expected was node retrieval. The Python library stores nodes in a flat list and does a linear scan to find a node by ID. That’s O(n). At 28 nodes it’s fine. At 765 nodes across a large document corpus it becomes measurably slow and, more importantly, wildly inconsistent at the tail.

Building pageindex-rs

The Rust implementation follows the same architecture: parse markdown into a tree, assign dot-notation node IDs (1.2.3 rather than 0012), store nodes in a HashMap for O(1) lookup, expose everything to Python via PyO3.

The dot-notation IDs turned out to matter more than I expected. When you show an LLM an outline with IDs like 1.2.3, it immediately understands the hierarchy — 1.2.3 is a child of 1.2, which is a child of 1. With zero-padded sequential IDs like 0012, the LLM just sees a number with no structural signal. This affected retrieval accuracy in the benchmarks, which I’ll get to.

The Python API looks like this:

import pageindex_rs
index = pageindex_rs.PageIndex.from_file("annual_report", "report.md")
# Feed this to your LLM
print(index.outline())
# [1] Annual Report 2023
# [1.1] Financial Highlights
# [1.2] Risk Factors
# [1.2.1] Market Risk
# [1.2.2] Liquidity Risk
# Fetch the node your LLM returned
node = index.get_node("1.2.2")
print(node.title) # Liquidity Risk
print(node.text) # The company's liquidity position…
print(node.breadcrumb) # ['Risk Factors', 'Liquidity Risk']
# Get a full section with all subsections merged
section = index.get_node_with_children("1.2")

The retrieval loop is a handful of lines:

outline = index.outline()
node_id = llm(f"""
Document outline:
{outline}
Question: {user_query}
Return only the node_id of the most relevant section. Nothing else.
""").strip()
result = index.get_node(node_id)
# Pass result.text to your LLM for the final answer

The benchmarks

I ran three benchmark suites across three document sizes — a 42KB single article, a 395KB multi-article corpus, and a 1055KB large corpus. 500 iterations per build test, 1000 random lookups per retrieval test. The full notebook is in the repo.

Index build speed

Document size

Rust mean

Python mean

Speedup

42 KB

0.207 ms

0.153 ms

0.74x ❌

395 KB

0.873 ms

1.369 ms

1.57x

1055 KB

2.549 ms

4.278 ms

1.68x

Below ~200KB, PyO3 FFI overhead cancels the parsing speedup — Rust actually loses at small scale. I’m reporting this honestly because benchmarks that only show wins aren’t useful. At realistic document sizes the picture flips.

The more important number is consistency. This is what production systems actually care about:

Document size

Rust p99

Python p99

Rust max

Python max

42 KB

1.3 ms

0.2 ms

17.4 ms

0.4 ms

395 KB

1.1 ms

1.5 ms

1.3 ms

1.6 ms

1055 KB

2.8 ms

21.0 ms

3.7 ms

42.9 ms

At 1055KB, Python’s p99 is 21ms and its max is 42ms. Rust’s p99 is 2.8ms and max is 3.7ms. Python’s standard deviation at that size is 2.78ms versus Rust’s 0.10ms — 27x more variable. In a pipeline processing hundreds of documents those spikes accumulate into real latency.

Node retrieval speed

This is where the O(1) vs O(n) gap shows most clearly:

Document size

Nodes

Rust mean

Python mean

Speedup

42 KB

28

0.0072 ms

0.0060 ms

0.83x

395 KB

261

0.0119 ms

0.0272 ms

2.29x

1055 KB

765

0.0216 ms

0.0686 ms

3.18x

At 28 nodes, linear scan is fast enough that the HashMap overhead tips Rust slightly negative. At 765 nodes, Rust is 3.18x faster. The gap keeps widening — at 5000 nodes in a combined corpus it would be around 10x.

Answer accuracy

I tested both on 10 financial questions against a ~3MB document corpus using the same LLM for both:

Implementation

Correct

pageindex-rs

9 / 10

PageIndex (Python)

7 / 10

The accuracy difference comes down to node ID format. 1.2.3 gives the LLM structural signal for free. 0012 does not. Small design decisions compound.

What I learned

Benchmark before you build. The small document results prove that Rust isn’t automatically faster — FFI overhead is real and it dominates at small scales. If your documents are consistently under 200KB, the Python library is probably fine.

Consistency matters more than mean speed. The headline speedup numbers are nice but the stdev and p99 tell the real story for production. A system that’s 1.68x faster on average but 27x more consistent in stdev is a much better choice than the mean alone suggests.

Node ID design affects LLM behavior. I didn’t expect the dot-notation change to move accuracy by two questions out of ten, but it did. How you present structure to an LLM matters in ways that are hard to predict without actually running the experiment.

Try it

pip install pageindex-rs

Thanks for reading 😄