RAG went from research paper acronym to “every slide deck ever” in about a year. The pattern is simple on paper:

Embed documents → embed question → find similar chunks → feed to LLM.

But if you’ve tried building something non-trivial (say, a domain expert assistant or an internal knowledge bot), you already know the bad news:

Spring AI actually ships a pretty solid set of primitives for doing RAG properly. The goal of this article is to walk through those pieces as an end-to-end workflow, and show where you can twist the knobs in real projects.

We’ll follow the real life-cycle of a RAG system:

  1. Indexing / ETL – document ingestion, cleaning, chunking, metadata
  2. Vectorization & storage – embeddings, vector DBs, batching
  3. Retrieval – pre-retrieval query shaping, semantic search, filters, merging
  4. Generation – query + context orchestration, error handling, and advisors
  5. Tuning & advanced patterns – thresholds, chunk sizes, hybrid retrieval, and more

All examples are in Java/Spring, but the ideas carry over to any stack.


Step 1: ETL – Turning Messy Docs into AI-Native Knowledge

A lot of RAG failures are caused before the first token hits the LLM: the documents themselves are a mess.

Make “AI-native” documents first

If you’re serious about RAG, stop thinking of your PDFs as ground truth. Instead, think in terms of AI-native documents:

You can absolutely use an LLM offline to normalize docs into a clean Markdown or HTML format before they ever hit Spring AI.

Spring AI’s document model

Spring AI wraps content as a Document:

The ETL pipeline is built around three interfaces:

Extract: DocumentReader

Spring AI ships readers for JSON, text, Markdown, PDF, HTML, and more. For example, a JSON reader that pulls specific paths with JSON Pointer:

@Component
public class ProductJsonReader {

    private final Resource resource;

    public ProductJsonReader(@Value("classpath:products.json") Resource resource) {
        this.resource = resource;
    }

    public List<Document> read() {
        JsonDocumentReaderConfig config = JsonDocumentReaderConfig.builder()
            .withPointers("/products/*/description")
            .withAdditionalMetadata("source", "product-catalog")
            .build();

        JsonDocumentReader reader = new JsonDocumentReader(resource, config);
        return reader.get();
    }
}

The same pattern applies for Markdown, PDFs, emails, videos, GitHub docs, databases, etc. Each reader turns your “source world” into List<Document>.

Transform: chunking, enrichment, formatting

This is where most of the interesting tuning happens.

1. Chunking with TokenTextSplitter

TokenTextSplitter is the workhorse splitter based on token counts + simple heuristics (sentence boundaries, newlines, etc.). It’s a DocumentTransformer, so you can stack it with others:

@Component
public class SmartChunkTransformer {

    public List<Document> split(List<Document> docs) {
        // slightly smaller chunks than the defaults
        TokenTextSplitter splitter = new TokenTextSplitter(
            700,   // target tokens per chunk
            280,   // min chars before we try to break
            8,     // min chunk length to embed
            8000,  // max number of chunks
            true   // keep separators like newlines
        );
        return splitter.apply(docs);
    }
}

Play with chunk size per use-case:

2. Metadata enrichment with an LLM

Don’t rely only on raw text similarity. Use an LLM once during ingestion to extract semantic features into metadata, then filter on them cheaply at query time.

Spring AI has KeywordMetadataEnricher and SummaryMetadataEnricher, both built on a ChatModel:

@Component
public class KeywordEnricher {

    private final ChatModel chatModel;

    public KeywordEnricher(ChatModel chatModel) {
        this.chatModel = chatModel;
    }

    public List<Document> enrich(List<Document> docs) {
        KeywordMetadataEnricher enricher = new KeywordMetadataEnricher(chatModel,
            List.of("keywords", "category"));
        return enricher.apply(docs);
    }
}

You can chain it right after chunking: split → add keywords → maybe add summaries.

3. DefaultContentFormatter – underused but powerful

DefaultContentFormatter lets you control exactly how metadata + content are stitched into the final string that goes into the embedding or the prompt.

Example:

DefaultContentFormatter formatter = DefaultContentFormatter.builder()
    .withMetadataTemplate("{key}: {value}")
    .withMetadataSeparator("
")
    .withTextTemplate("{metadata_string}

{content}")
    // Don’t leak embedding IDs into prompts
    .withExcludedInferenceMetadataKeys("embedding_id", "vector_id")
    .build();

Use it when you want the LLM to see things like:

type: love_advice
status: single
year: 2025

如何提升自己的恋爱吸引力?...

instead of a bare blob of text.

Load: DocumentWriter & ETL wiring

DocumentWriter is just Consumer<List<Document>>. There’s FileDocumentWriter (for plain files) and VectorStoreWriter (for embedding + vector DB).

A minimal ETL could look like this:

@Component
public class KnowledgeBaseIndexer {

    private final ProductJsonReader reader;
    private final SmartChunkTransformer splitter;
    private final KeywordEnricher enricher;
    private final VectorStore vectorStore;

    public KnowledgeBaseIndexer(ProductJsonReader reader,
                                SmartChunkTransformer splitter,
                                KeywordEnricher enricher,
                                VectorStore vectorStore) {
        this.reader = reader;
        this.splitter = splitter;
        this.enricher = enricher;
        this.vectorStore = vectorStore;
    }

    public void rebuildIndex() {
        List<Document> raw = reader.read();
        List<Document> chunks = splitter.split(raw);
        List<Document> enriched = enricher.enrich(chunks);
        vectorStore.add(enriched);
    }
}

This pipeline alone already puts you ahead of most “we just embedded everything once” demos.


Step 2: Vectorization & Storage – Choosing Your Retrieval Backbone

Once you have clean, chunked, enriched documents, you need a place to put them.

Spring AI’s VectorStore interface is intentionally simple:

public interface VectorStore extends DocumentWriter {

    void add(List<Document> documents);

    void delete(List<String> ids);

    void delete(FilterExpression filterExpression);

    List<Document> similaritySearch(SearchRequest request);

    default String getName() {
        return getClass().getSimpleName();
    }
}

The important bit for tuning is SearchRequest:

SearchRequest request = SearchRequest.builder()
    .query("How does Spring AI handle RAG?")
    .topK(5)
    .similarityThreshold(0.75)
    .filterExpression("category == 'spring-ai' && year >= '2024'")
    .build();

List<Document> docs = vectorStore.similaritySearch(request);

Which vector store?

Spring AI ships starters for many backends: in-memory, Redis, Elasticsearch, PGVector, Qdrant, etc. Spring AI adds cloud-native ones via DashScope (DashScopeCloudStore).

For backend-heavy Java shops, PGVector on PostgreSQL is incredibly pragmatic:

PGVector with Spring AI (hand-rolled config)

Instead of relying on auto-config, you can wire PgVectorStore yourself and pick exactly which EmbeddingModel you want:

<!-- pom.xml -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-jdbc</artifactId>
</dependency>
<dependency>
    <groupId>org.postgresql</groupId>
    <artifactId>postgresql</artifactId>
    <scope>runtime</scope>
</dependency>
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-pgvector-store</artifactId>
    <version>1.0.0-M7</version>
</dependency>
# application.yml
spring:
  datasource:
    url: jdbc:postgresql://YOUR_HOST:5432/rag_demo
    username: rag_user
    password: super_secret
  ai:
    vectorstore:
      pgvector:
        index-type: HNSW
        distance-type: COSINE_DISTANCE
        # dimensions: 1536  # omit to let it derive from the embedding model

Now the config class:

@Configuration
public class PgVectorConfig {

    @Bean
    public VectorStore pgVectorStore(JdbcTemplate jdbcTemplate,
                                     @Qualifier("dashscopeEmbeddingModel")
                                     EmbeddingModel embeddingModel) {

        return PgVectorStore.builder(jdbcTemplate, embeddingModel)
            .dimensions(1536) // match your embedding model
            .distanceType(PgDistanceType.COSINE_DISTANCE)
            .build();
    }
}

A common gotcha: if you use multiple EmbeddingModel beans (e.g., Ollama + DashScope), make sure you qualify the one you actually want for this store.

BatchingStrategy – don’t blow up your embed API

Embedding thousands of chunks in a single call will eventually hit context window or rate limits. Spring AI’s BatchingStrategy lets you split documents into sane batches before embedding:

@Configuration
public class EmbeddingBatchConfig {

    @Bean
    public BatchingStrategy batchingStrategy() {
        return new TokenCountBatchingStrategy(
            EncodingType.CL100K_BASE,
            8192,  // max tokens per batch
            0.15   // leave some safety headroom
        );
    }
}

You can also implement your own BatchingStrategy if your vector DB has, for example, hard throughput limits and you want to throttle inserts explicitly.


Step 3: Retrieval – Query Shaping, Filters, and Result Merging

Once your store is populated, you still can’t just do similaritySearch(userText) and call it a day.

Spring AI breaks retrieval into pre-retrieval, retrieval, and post-retrieval stages.

Pre-retrieval: shaping the query

RewriteQueryTransformer – clean up messy user queries

Users don’t speak like search queries. RewriteQueryTransformer uses an LLM to rewrite a noisy query into something more explicit and model-friendly.

@Component
public class QueryRewriter {

    private final QueryTransformer transformer;

    public QueryRewriter(ChatModel chatModel) {
        ChatClient.Builder builder = ChatClient.builder(chatModel);
        this.transformer = RewriteQueryTransformer.builder()
            .chatClientBuilder(builder)
            .build();
    }

    public Query rewrite(String text) {
        return transformer.transform(new Query(text));
    }
}

Plug this into your RAG pipeline right before you call the retriever.

TranslationQueryTransformer – cross-language users, single-language embeddings

If your embedding model is English-only but your users speak Chinese, Spanish, etc., you can stick a TranslationQueryTransformer in front. It’s literally “LLM-as-translation-layer” – simple but not cheap. For production, many teams prefer a dedicated translation API + custom transformer.

CompressionQueryTransformer – distill long chat history

Multi-turn chats tend to accumulate context. CompressionQueryTransformer compresses history + latest user message into one standalone query. Perfect when you use conversation history but your vector search only sees the final “intent”.

Retrieval: VectorStoreDocumentRetriever

The main entry point for document retrieval is DocumentRetriever. For vector-based RAG, you usually use VectorStoreDocumentRetriever:

DocumentRetriever retriever = VectorStoreDocumentRetriever.builder()
    .vectorStore(vectorStore)
    .similarityThreshold(0.6)
    .topK(4)
    .filterExpression(new FilterExpressionBuilder()
        .eq("type", "love_advice")
        .eq("status", "single")
        .build())
    .build();

List<Document> docs = retriever.retrieve(new Query("怎么提升自己的吸引力?"));

Filters here are metadata filters, not semantic. This is why earlier enrichment pays off: you can reduce the search space to just “love_advice + single” before even running similarity.

You can also pass the filter expression dynamically via Query.context if you need per-request logic.

Document merging: ConcatenationDocumentJoiner

If you use multiple retrievers (multi-query, multi-source, hybrid search), you’ll end up with many document lists. ConcatenationDocumentJoiner deduplicates and flattens them:

Map<Query, List<List<Document>>> docsPerQuery = ...;

DocumentJoiner joiner = new ConcatenationDocumentJoiner();
List<Document> merged = joiner.join(docsPerQuery);

Under the hood it’s basically:

It’s simple but exactly what you want right before handing everything to the generation stage.


Step 4: Generation – Advisors, Context, and Error Handling

Spring AI’s RAG story really becomes ergonomic when you wire it into ChatClient advisors.

QuestionAnswerAdvisor – good default, minimal ceremony

QuestionAnswerAdvisor is the fastest way to get RAG working:

Advisor qaAdvisor = QuestionAnswerAdvisor.builder(vectorStore)
    .searchRequest(SearchRequest.builder()
        .similarityThreshold(0.7)
        .topK(5)
        .build())
    .build();

String answer = chatClient.prompt()
    .user("如何在三个月内提升社交魅力?")
    .advisors(qaAdvisor)
    .call()
    .content();

The advisor:

  1. Takes the user message
  2. Runs a vector search
  3. Stitches docs + question into a prompt
  4. Calls the model

You can override the prompt template if you want strict instructions (“only answer using the context above, otherwise say you don’t know”).

RetrievalAugmentationAdvisor – full modular RAG graph

For more control you switch to RetrievalAugmentationAdvisor. It lets you explicitly plug in:

Example with a query rewriter + vector retriever:

Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
    .queryTransformers(RewriteQueryTransformer.builder()
        .chatClientBuilder(ChatClient.builder(chatModel))
        .build())
    .documentRetriever(VectorStoreDocumentRetriever.builder()
        .vectorStore(vectorStore)
        .similarityThreshold(0.55)
        .topK(6)
        .build())
    .build();

Then:

String reply = chatClient.prompt()
    .user("what's the advice of career?")
    .advisors(ragAdvisor)
    .call()
    .content();

ContextualQueryAugmenter – what to do when retrieval finds nothing

By default, RetrievalAugmentationAdvisor is conservative. If retrieval returns no docs, it swaps your user query with a “out of knowledge base, please refuse” prompt.

You can customize this using ContextualQueryAugmenter:

PromptTemplate emptyContextTemplate = new PromptTemplate("""
You are a relationship advice assistant.
The current question is outside your knowledge base. 
Please respond briefly and politely in English, telling the user:
You can only answer relationship-related questions, 
and invite them to describe their situation more specifically.
""");

ContextualQueryAugmenter augmenter = ContextualQueryAugmenter.builder()
    .allowEmptyContext(true) // keep original question even when no docs found
    .emptyContextPromptTemplate(emptyContextTemplate)
    .build();

Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder() .documentRetriever(retriever) .queryAugmenter(augmenter) .build();

This gives you graceful degradation instead of raw hallucination.

### A custom advisor factory for a “love coach” bot

Putting it all together, you can hide the complexity behind a small factory:

```java
public final class LoveCoachAdvisorFactory {

    private LoveCoachAdvisorFactory() {}

    public static Advisor forStatus(VectorStore store, String status) {
        DocumentRetriever retriever = VectorStoreDocumentRetriever.builder()
            .vectorStore(store)
            .similarityThreshold(0.55)
            .topK(4)
            .filterExpression(new FilterExpressionBuilder()
                .eq("type", "love_advice")
                .eq("status", status)
                .build())
            .build();

        ContextualQueryAugmenter augmenter =
            LoveAppContextualQueryAugmenterFactory.createInstance();

        return RetrievalAugmentationAdvisor.builder()
            .documentRetriever(retriever)
            .queryAugmenter(augmenter)
            .build();
    }
}

Your chat layer doesn’t need to know anything about vector stores anymore; it just picks an advisor based on the user profile.


Step 5: Tuning Playbook – Making RAG Not Suck

Now to the parts you usually end up rediscovering the hard way.

1. Document strategy first, everything else later

If your knowledge base is incomplete or badly structured, no amount of thresholds or LLM trickery will save you.

Checklist:

When in doubt, run offline retrieval tests: generate 50–100 realistic questions and see what the retriever actually surfaces.

2. Chunking: avoid both over- and under-splitting

Bad chunking shows up as:

Practical patterns:

If you use Cloud Model Studio, enabling intelligent chunking on the knowledge base will apply a similar strategy: first split by sentence markers, then adapt chunk boundaries by semantic coherence instead of length alone. You still need to manually fix any obvious mis-splits in the console.

3. Metadata: design it like an index, not an afterthought

Good metadata makes filtering trivial:

Implement metadata as close to the source of truth as possible (e.g. in your CMS or docs repo), then enrich with AI-only fields such as keywords or summary during ingestion.

4. Tuning similarityThreshold and topK

This is where a lot of “RAG feels off” comes from.

Rules of thumb:

Always test with a fixed set of labeled queries so you can see whether tuning helps or hurts.

5. Hallucination and refusal behavior

Even with perfect retrieval you’ll get edge cases. Mitigation options:

6. Multi-query expansion – use, but don’t abuse

MultiQueryExpander can boost recall by generating paraphrased queries, but:

If you use it, limit to 3–5 variants, deduplicate aggressively, and monitor cost and latency.


Step 6: Beyond Basics – Hybrid Retrieval & Higher-Level Architectures

Once the basics are solid, you can start layering in more advanced patterns.

Hybrid retrieval: vector + lexical + structured

No single retrieval method is perfect:

A robust system typically combines them:

  1. Filter by metadata (type, status, year)
  2. Run vector search on the reduced candidate set
  3. Optionally mix in keyword search for exact matches on IDs, names, etc.
  4. Merge and re-rank

Spring AI doesn’t force a single pattern—DocumentRetriever is just an interface. You can write your own “hybrid retriever” that fans out to both vector store and, say, Elasticsearch, then uses ConcatenationDocumentJoiner + custom scoring.

Re-ranking and multi-stage pipelines

For large corpora, you may want a two-stage retrieval:

  1. Fast, approximate search (HNSW index, low-dimensional embeddings)
  2. Slow, precise re-ranking with a cross-encoder model (e.g., a reranker that scores each (query, chunk) pair)

The first stage optimizes recall, the second optimizes precision. Spring AI’s modular design makes it straightforward to put the reranker into the post-retrieval step before sending docs to the LLM.

Architecture patterns: central knowledge service

At system level, a neat pattern is to isolate RAG into its own service:

This gives you:


Wrap-up

RAG is not just “add embeddings, stir with LLM, ship.” It’s a pipeline:

Spring AI gives Java developers real building blocks instead of a giant black box: DocumentReader / DocumentTransformer / DocumentWriter, VectorStore, DocumentRetriever, Advisor, and a bunch of utilities around them.

If you treat these as serious, tunable components instead of “just configuration,” your RAG system will stop feeling like a fragile demo and start feeling like a real product.