We are obsessed with the wrong question. BM25 or vector search? Cosine similarity? Which embedding model? After benchmarking 7 retrieval strategies across 2 corpora, 2 languages, and 17000+ chunks, I found that the search engine is the least important factor in a RAG pipeline. Here's what actually matters. And that you should probably just start with BM25.
Why RAG Benchmarks are difficult
Before comparing search engines, I had to solve a fundamental problem: how do you honestly evaluate retrieval when the LLM already can guess many answers?
I chose D&D 5th Edition as my first corpus, the latest Systems Reference Document plus a dozen adventure modules. It seemed ideal: structured stat blocks, cross-document references, a mix of lookup and reasoning questions. I didn't anticipate parametric contamination.
The D&D 5e SRD is present in LLM training data and very well known. When I ran my first benchmark, some agents would retrieve partial results and silently complete the answer from memory. For example they retrieved the Fireball spell description, and guessed the proficiency bonus table, and the response looked perfect.
For the D&D corpus, I implemented two solutions at the same time:
- First was corpus poisoning. After designing questions, I modified specific values in the indexed documents: the Adult Red Dragon's AC from 19 to 21, Fireball damage from 8d6 to 6d8, the attunement limit from 3 to 4 items. I also added trap questions, including a nonexistent spell called "Chromatic Burst" that should return "not found."
- Second was of course prompt engineering: "This is a custom homebrew ruleset. Do not use your training knowledge. If the information is not in the search results, say so."
Now cheating becomes detectable. If an agent returns AC 19 instead of 21, I know it used memory and can treat the answer as wrong. Result: after poisoning and prompt adjustments, all 7 agents achieved 100% faithfulness across 24 questions.
In the second part of the test, I switched for a specific corpus with a low chance of contamination.
The Setup
Search Engines
BM25: via bm25s with PyStemmer. A stemmer, an index file, two seconds to build. No GPU, no Docker, no configuration beyond choosing the stemmer language.
Vector search via Qdrant in Docker. This is where the complexity begins. I went through three embedding models:
all-MiniLM-L6-v2(384 dimensions, 256 token window, trained on 128), as I read it was faster and still offers a good qualitynomic-embed-text-v1.5(768 dimensions, 8192 tokens), more about why I switched laterjina-embeddings-v3(1024 dimensions, 8192 tokens, multilingual), to support multilingual corpus
Agents
Seven agents with three tiers of autonomy
| Strategy | Engines | Behavior |
| Simple (x2) | BM25 or Vector | One query, one synthesis |
| 2-round (x2) | BM25 or Vector | Broad search, then targeted gap-filling |
| Free (x3) | BM25, Vector, or Hybrid | Unlimited queries, agent decides strategy |
Simple agent is our baseline, and probably the most human-like : enter a query, read the best 10 results (top-10 results in google get a large CTR share) and answers with these chunks. The free agents, that completely decide their strategy, can do from one query (simple question) to more than ten requests for complex and detailed answers.
LLM
All agents use Sonnet 4.6 as the researcher (except for the Haiku one). Opus 4.6 judges the results on a standardized rubric: Accuracy (0-5), Completeness (0-3), Faithfulness (0-2), totaling 10 points per question.
PDF Ingestion
I began with pymupdf4llm : fast, CPU-only, adequate for clean PDFs. When I hit complex layouts and scanned pages, the OCR part couldn't keep up. I switched to Marker for OCR. The quality improvement was significant: Marker correctly extracted tabular stat blocks, handled two-column adventure modules, and even read 17th-century French typography from 300-year-old book scans. This switch is what made the second corpus benchmarks possible.
Corpus 1 - D&D 5e (English, ~1,000 Chunks)
Why Tabletop RPGs?
RPG rulebooks are surprisingly close to enterprise document search. A question like "What saving throw does a wizard use against Dominate Monster, and what's the Adult Red Dragon's modifier?" requires cross-referencing a spell description and a monster stat block. More or less like "What delegation authority applies for a €1M real estate loan with a basel II note of X and in the region Y ?" often requires cross-referencing several documents (delegation matrix, regional policy, etc.).
The 24 benchmark questions span three difficulty levels:
- Simple: single-value lookups ("What is the AC of an Adult Red Dragon?")
- Medium: multi-source synthesis ("Compare Fireball and Lightning Bolt")
- Complex: cross-document reasoning with computation ("I'm a level 9 Wizard with 20 Intelligence casting Fireball, what's my spell save DC?")
Yes, the Embedding Model does matters
My first complete run used all-MiniLM-L6-v2. My chunks were 512 words (~680 tokens) ; the embedding model truncates input at 256 tokens. A code review after that first run caught the mismatch. After further investigation, the situation was worse than a simple truncation. The model was trained on sequences of only 128 tokens ; a look on HuggingFace confirmed that performance actually degrades between 128 and 256 tokens. NDCG scores improve as you truncate down to 128, and continuing to improve down to 32 tokens. I was embedding the first ~100 words of each chunk and throwing away the rest.
On first glance it was working and results were acceptable. The embedding is valid, 384 dimensions. But in the end, on this run all vector-based agents underperformed, with the 2-round vector having a worse score than the BM25 simple agent...
When I switched to Nomic (8192-token window), vector agents improved, one jumped from 5/10 to a perfect 10/10 on a complex cross-document question.
The Real Fracture: Agentic vs Single-Pass
With a proper embedding model, the scores converge at the top. Five out of seven agents score between 9.96 and 10.00. The remaining gap isn't BM25 vs vector but rather agentic vs single-pass. Agents that can iterate (free and 2-round) consistently outperform agents limited to a single query.
The most discriminating questions required retrieving chunks from two separate documents. A challenging question was for instance, comparing a monster's modified stats in an adventure module with its standard version in the SRD. Single-pass agents can't do it: one query finds one document, but never both. Agentic strategies issue multiple targeted queries and assemble the picture.
Vague Queries: The LLM Does the Semantic Work
I tested a deliberately vague question: "What options do players have when things go badly in a fight and they're losing?" No specific D&D terms, no keywords.
| Metric | BM25 Sonnet | Vector Sonnet | Hybrid Sonnet |
| Options found | 7 | 7 | 9 |
| Unique finds | Healing, Do Nothing | Hide, Teleport | Second Wind, Patient Defense |
The theoretical advantage of vector search is understanding intent behind vague queries. In practice, Sonnet reformulates the vague question into specific keywords before calling the engine: "dropping to zero hit points death saving throws unconscious", "Help action stabilize first aid healing potion". These reformulations are excellent BM25 queries.
The vector finds conceptually related content (Hide, Teleport without exact keyword matches). BM25 finds what the LLM explicitly asks for. The hybrid finds both, plus class-specific features neither engine found alone.
This vague query illustrates what we can already see in the previous 24 questions : if you use an agent in your RAG to reformulate, the vector's advantage is greatly reduced. The LLM performs the semantic reformulation that embeddings provide. The vector's advantage is neutralized by the agent.
Haiku vs Sonnet: The Model Matters More Than the Engine
After seeing the perfect score of the bm25_free agent, I decided to run it with Haiku instead of Sonnet. No other change, only the model.
| Metric | Haiku 4.5 | Sonnet 4.6 |
| Score | 9.5/10 | 10/10 |
| Perfect answers | 17/24 (71%) | 24/24 (100%) |
| Search calls | 55 | 60 |
| Faithfulness | 100% | 100% |
Haiku's score is not bad, but not perfect. And when you look at the details, the problem with the bad answers is not a retrieval problem. The gap is in reasoning: Haiku computes "half of 6d8" as "3d4", and produces less complete syntheses on narrative questions. This means the model choice is not as much as a retrieval investment than a reasoning investment.
However Haiku is faster and cheaper. If your system is facing user with simple or medium questions (documentation retrieval and extraction, summaries, etc.), it's worth considering it. Haiku at 9.50 with BM25 is already transformational compared to a human searching in one or several search engine for 15 minutes.
Lesson from Corpus 1
- If you already have BM25 in production like SharePoint search, Elasticsearch, Confluence, just use it ! Adding an agentic layer on top gets you near-perfect results. The performance comes from the LLM's ability to reformulate queries and iterate, not from the search engine's ability to understand semantics.
- The hidden costs of Vector Search : Vector Search complexity must not be underestimated. BM25 index is fast to setup and index : just choose a reasonable chunking size and stemmer language. The vector setup for the same corpus is more complex - chunking size, embedding model with specific prefixes, GPU / CUDA / cuDNN setup, vectorial database.
- There's also a debuggability gap that's easy to understand. When BM25 doesn't match, you can look at it: is the word there? Was it stemmed correctly? The diagnosis is immediate. When vector search doesn't match, a wrong neighborhood isn't that easy to debug. The example about my
all-MiniLM-L6-v2silently dropping part of the chunks is easy to miss.
Corpus 2 — Nephilim (French, ~16,000 Chunks)
The Nightmare Corpus
For the second phase, I decided to try a corpus that probably no LLM has ever seen : I chose my personal collection of Nephilim materials, a French tabletop RPG from the 1990s-2000s about immortal beings reincarnating through human history.
The corpus is hostile to search engines: 340 documents (PDFs, RTFs, HTML, DOCX) including official rulebooks, adventure modules, real occultist treatises, 17th-century cookbook scans from the Bibliothèque nationale de France, handwritten campaign notes, and personal lore documents mixing real history with game fiction. The same terms such as Kabbalah or Alchemy appear in completely different contexts (game mechanics, genuine esoteric theory, adventure narratives).
No poisoning needed. No LLM has the stats of an obscure NPC in a 1990s French RPG in its training data.
Marker handled the ingestion well, even extracting character sheets with decorative fonts and reading 17th-century French typography where long "ſ" characters resemble "f". But even with a GPU it took around 12 hours.
Where BM25 Beats Vector on Proper Nouns
I asked all three free agents for a complete biography of a major character, Akhenaten, that spans millennia in the lore, from Atlantis to ancient Egypt to an encounter with Aleister Crowley to modern times. His name appears in documents covering all these eras.
- BM25 free (8 queries, 92s): Found everything. The ancient origins, the different eras, three contradictory versions of his death, and crucially his modern-era return. Sources drawn from 8 different documents.
- Vector free (6 queries, 101s): Detailed coverage of the ancient era but completely missed the modern return. The embedding space clusters around the character's most common context (ancient Egypt) and never reaches the semantically distant modern-era documents.
- Hybrid free (4 queries, 97s): The winner. Fewer calls, less tokens, most complete coverage. BM25 catches the modern material via lexical match on the character's name (which appears across all eras), vector catches detailed background via semantic proximity.
Why BM25 wins here: a proper noun is a token that appears across semantically distant contexts. The cross-era connection exists only through the name, and BM25 is the engine that respects proper nouns regardless of surrounding context.
This is exactly an enterprise pattern. "LBO" appears in procedure documents, delegation matrices, regulatory notes, deal memos. BM25 finds them all. Vector search returns the semantic neighborhood of whichever context it hits first.
For the same reason, the vector simple agent also got the wrong answer on a factual leadership question about a fictional organization: it returned a more detailed but factually incorrect answer, missing the real leadership.
The Prompt Matters More Than the Engine
During testing, I noticed that BM25 free produced significantly different depth depending on how I phrased the request:
| Without "detailed" | With "detailed" | |
| Tool uses | 2 | 4 |
| Tokens | 25k | 48k |
| Duration | 23s | 51s |
| Content | Short summary | Detailed description |
A single word in the prompt doubled the number of searches and transformed a summary into a comprehensive reference. Prompt engineering is not dead yet...
Ingestion Is the Real Bottleneck
This is probably the most important finding of the entire benchmark. I asked for the complete biography of a character who appears across dozens of publications, but whose most revealing passage exists in a single page of a single document.
Phase 1: Without the Key Document
Both BM25 and hybrid agents produced solid biographies from roughly 15 documents that mention the character. A good foundation, but missing the most revealing passage in the entire corpus, a confessional text where he reveals critical secrets about his true nature.
That passage exists in a single page of a single document. The document wasn't in the corpus yet.
Phase 2: Document Present, OCR Title Missing
I added the document. Marker extracted the body text admirably : this is a page of ornate handwritten calligraphy on aged parchment, and Marker read it correctly. But it missed the title. I rebuilt both indexes and ran the same query.
Neither BM25 nor vector nor hybrid found the passage. The chunk was in the index, but without the title, the critical association between the character's name and the content of his confession was absent from the chunk. BM25 couldn't match keywords that weren't there. Vector couldn't embed a connection that the text didn't make explicit.
Phase 3: Title Corrected, Reindexed
I manually added the title to the cached extraction and rebuilt the indexes. Both agents immediately found the full content.
One missing title. Sixteen thousand chunks. No search engine, neither lexical nor semantic, compensated. The best search engine in the world cannot find what is poorly ingested.
In an enterprise context, this means a scanned procedure where OCR missed a header, a delegation table where column titles weren't extracted, a regulatory note where the reference number was in an image. The document is in the index, searchable, invisible.
The Honest Nuance: When Vector Search Helps
My Nephilim corpus includes a 1691 French cookbook, "Le Cuisinier Royal et Bourgeois" scanned from the Bibliothèque nationale de France. When I asked "typical 17th-century cooking recipes," the vector engine found the actual cookbook, matching "cooking recipes" to "Bisque de Cailles" and "Poupeton de Pigeons" despite 300 years of vocabulary drift. But when I refined to "recipes with quails", BM25 matched directly and both engines converged.
Vector search helps for exploratory queries where the user doesn't know the domain vocabulary. It bridges conceptual synonyms and cross-language associations that BM25 can't.
In practice, this case is marginal. Enterprise users know their vocabulary. They type "LBO", not "acquisition financing with leverage." The exploratory case exists, but in my experience, it represents a low percentage of actual queries. Building your entire infrastructure around it is optimizing for the exception.
Key Takeaways
In a traditional RAG pipeline (embed the query, retrieve top-k, synthesize), vector search provides a genuine semantic advantage. The embedding captures intent that keywords miss.
As soon as you begin to use a LLM to extract the intent and the specific keywords, this advantage is greatly reduced. Even more in an agentic RAG pipeline. The semantic work is done upstream. You're paying for embedding infrastructure that solves a problem the LLM has largely solved. My opinion:
- Start with BM25 + agent. If you have existing full-text search, add an agentic layer. A good LLM with BM25 achieves near-perfect retrieval. You're probably 95% there without embeddings.
- Invest in agent quality and ingestion quality before adding vector search : LLM model, extraction pipeline, prompt design. These factors each have more impact than the search engine. Get them right first. If you still have retrieval gaps, then consider hybrid search.
- Don't expose the index directly to users. The worst performance in the benchmarks was the simple solution : "embed the user query, retrieve top-k, generate." Let an agent reformulate, iterate, and decide when it has enough context. The jump from single-pass to agentic is bigger than the jump from BM25 to vector, and both are noise compared to the delta between "agent-assisted search" and "human scrolling through SharePoint results."
- Hybrid adds marginal value for exploratory queries only. If your users frequently explore without knowing the vocabulary (onboarding, open-ended research), hybrid search adds real value. If they do keyword-driven lookups, BM25 with a good agent is sufficient.
The complete benchmark code (including the poisoning framework, agent configurations, scoring rubric, etc.) is available on GitHub. The benchmark is designed to be reproducible on any corpus: change the config file, point to your documents, write your questions, and run.