sia.hackernoon.com

Retrieval Augmented Generation (RAG) looks deceptively simple when diagrams collapse it into “documents → embeddings → LLM.”

The operational reality is more opinionated: latency budgets shrink, corpus size grows, vectors evolve, and retrieval becomes a product workload rather than an academic demo.

That is exactly where the question shows up:

Can Amazon S3 function as a vector store and eliminate the need for a dedicated vector database?

I asked myself that question earlier this year when I built a Project for a large enterprise engineering team.

The goal was straightforward: ingest Confluence content, generate embeddings, store everything in AWS-native services, and expose a RAG endpoint for customer support agents.

To stay fully inside their compliance boundary, we used Amazon Titan Embeddings G1, Text for vector generation rather than calling external APIs. The customer also did not want to operate new services or clusters, so the initial design was aggressively minimalistic, S3 for document and vector storage, Lambda for embedding fan-out, and a retrieval Lambda to run cosine similarity over the corpus. No OpenSearch cluster, no Aurora pgvector, no Pinecone, just what already existed in their IAM and networking controls.

At a small scale, the architecture behaved well. With roughly 8,000 embeddings generated by Titan G1, Lambda could hydrate the vector set into memory, run a brute-force NumPy cosine kernel, and respond in ~350–500 ms end-to-end. The latency was already higher than ideal, but acceptable for a POC. The business users were excited because it “worked without standing up anything new,” and security teams were satisfied because everything stayed inside the AWS trust boundary.

The architecture began collapsing as we expanded the corpus.

When we crossed ~40k embeddings, the brute-force scan inside Lambda became an I/O and cold-start tax. Lambda needed more memory just to pull objects, concurrency spikes triggered throttling, and similarity computation pushed response time into seconds.

Engineers attempted a micro-optimization, sharding S3 keys and fanning out parallel GETs. That reduced wall-clock time for downloads but amplified cost and still lacked an index. At ~80k embeddings, the entire search path became unstable.

That was the inflection point where the limitations of S3’s access pattern were impossible to ignore.

Using S3 to store vectors requires pulling objects through a high-latency network path, deserializing them into local arrays, and executing similarity computation outside the storage layer. Because S3 has no approximate-nearest-neighbor index (HNSW, PQ, IVF, ScaNN) and no locality-aware layouts, every lookup degenerates into O(N × D) cost.

For large Titan G1 embeddings (1536-dimensional) and 100k + vectors, no amount of BLAS optimization compensates for network overhead and full-scan execution.

Embeddings | P50 Latency | P95 Latency | P99 Latency | 
-----------|-------------|-------------|-------------|-------------------
10         | ~200 ms     | ~350 ms     | ~550 ms     | Cold start + S3 round-trip 
100        | ~220 ms     | ~420 ms     | ~700 ms     | CPU mostly idle
1,000      | ~280 ms     | ~520 ms     | ~900 ms     | First noticeable cost from vector
8,000      | 350–500 ms  | 650–1,100 ms| 900–1,700 ms| Lambda memory pressure + payload
40,000     | 1.2–2.0 s   | 2.5–4.5 s   | 3.5–6.5 s   | Full-scan and S3 I/O > compute
80,000     | 2.8–5.5 s   | 6–10 s      | 8–14 s      | GC pauses, throttling
120,000    | 4.5–9.0 s   | 10–18 s     | 14–25 s     | Lambda timeout in burst traffic
200,000    | 8–16 s      | 18–35 s     | 25–50 s     | Unusable for interactive RAG


S3 Full-Scan Vector Retrieval - Latency vs Corpus Size (Single Query)

Latency was the first visible failure mode. Production RAG workloads aim for <150 ms vector retrieval and <300 ms total completion time. S3 GET operations alone contributed 30 - 80 ms each. Pulling tens of megabytes per user query pushed the entire request into multi-second latency, making the system unacceptable for customer-facing use cases. At that point, the architecture resembled delayed batch analytics rather than online semantic search.

Concurrency was the second failure mode. The enterprise wanted multi-tenant RAG access, five business units, parallel usage, and bursty agent traffic. Vector databases solve that by sharding indexes, pinning vectors in RAM, memory-mapping ANN graphs, and routing concurrent queries.

S3 provides none of that.

Under load testing, Lambda concurrency throttles surfaced before the model even evaluated a single top-k result.

The third failure mode was content dynamism. The customer updated documentation daily. That required regenerating embeddings, replacing outdated vectors, tracking deltas, and supporting version isolation for compliance review (part of the requirement).

S3 has no indexing semantics and no incremental update path. Every update required re-hydrating large batches, rebuilding local search structures, and eventually re-publishing Lambdas.

At ~100k embeddings, this became operationally infeasible.

Metadata filtering introduced another hard stop. The business wanted to scope retrieval by geography, compliance level, and product classification, and eventually combine BM25 keyword signals with ANN similarity.

That required predicate pushdown, hybrid lexical-ANN scoring, and reranking, none of which S3 supports. Even object-tag metadata cannot support low-latency predicate-based filtering.

Ultimately, we migrated to OpenSearch Vector Engine.

Embeddings | P50 Latency | P95 Latency | P99 Latency | 
-----------|-------------|-------------|-------------|---------------------------------------
10         | ~15 ms      | ~25 ms      | ~35 ms      | Index stays fully in memory
100        | ~18 ms      | ~30 ms      | ~45 ms      | ANN traversal cost negligible
1,000      | ~22 ms      | ~38 ms      | ~55 ms      | Cache locality improves consistency
8,000      | ~30 ms      | ~55 ms      | ~75 ms      | Stable tail latency; no full scan
40,000     | ~42 ms      | ~70 ms      | ~95 ms      | HNSW graph traversal; sub-100 ms
80,000     | ~55 ms      | ~88 ms      | ~120 ms     | Slight growth under burst traffic
120,000    | ~63 ms      | ~104 ms     | ~140 ms     | Memory-mapped index keeps latency
200,000    | ~75 ms      | ~130 ms     | ~180 ms     | Still viable for interactive RAG


AWS OpenSearch (HNSW) - Latency vs Corpus Size (Single Query)

The move unlocked HNSW indexing, metadata filters, BM25 + ANN hybrid ranking, and sub-200-ms retrieval.

Yes, operational cost increased due to maintaining a search cluster, but overall compute cost dropped, Lambda execution disappeared, and the platform finally behaved like a proper retrieval system.

This Project reinforced a simple rule: S3 is excellent as archival and a durable data-lake substrate, not a real-time retrieval engine. It works well only when:

embeddings stay below ~50k
updates are infrequent (quarterly, maybe monthly)
latency budgets are relaxed
usage is single-tenant or internal
RAG is an assistive utility rather than a production SLA surface

These constraints map very closely to AWS Bedrock Knowledge Bases, which prioritize ease of entry instead of high-density semantic search.

The architecture breaks down the moment RAG becomes a platform. When you require:

100k–200k embeddings
<200 ms retrieval
hybrid lexical + ANN
continuous ingestion
multi-tenant traffic
version isolation
predictable recall stability

You need a vector database.

On AWS this could be OpenSearch Vector Engine or Aurora PostgreSQL with pgvector, though Aurora becomes compute-bound as cardinality scales.

Conclusion

The production architecture that finally stabilized consisted of: S3 for immutable document storage, Step Functions for ingestion + embedding, a vector index for semantic search, and Bedrock for answer synthesis. That allowed rollback, version control, operational scaling, and latency guarantees.

Embeddings | S3 Full Scan  | OpenSearch HNSW  | Why This Happens
-----------|---------------|------------------|------------------------------------------
10         | ~60 KB        | ~10–50 KB        | Fixed overhead dominates
100        | ~600 KB       | ~10–50 KB        | ANN traverses graph, not corpus
1,000      | ~6 MB         | ~10–50 KB        | S3 still manageable
8,000      | ~48 MB        | ~10–50 KB        | Payload hydration begins to dominate
40,000     | ~240 MB       | ~10–50 KB        | Network-bound retrieval
80,000     | ~480 MB       | ~10–50 KB        | Lambda memory + GC pressure
120,000    | ~720 MB       | ~10–50 KB        | Approaches Lambda limits
200,000    | ~1.2 GB       | ~10–50 KB        | Fundamentally non-viable


Data Transferred Per Query, S3 Full Scan vs ANN (OpenSearch HNSW)

Assumptions (Explicit = Reviewer-Safe)
Embedding dimension: 1536 (Titan Embeddings G1)
Data type: float32 (4 bytes)
Vector size: 1536 × 4 bytes ≈ 6 KB per vector
Single query, top-K = 10

S3 Full Scan: Data per query = N × 1536 × 4 bytes

OpenSearch HNSW: Data per query ≈ O(logN) × vector size

S3-based vector retrieval requires hydrating the full embedding corpus for every query. With 1536-dimensional float32 embeddings, this results in hundreds of megabytes transferred per query at modest corpus sizes. In contrast, ANN-backed engines such as OpenSearch HNSW traverse a bounded in-memory graph and touch only a small subset of vectors, keeping per-query data movement in the tens of kilobytes. This difference directly explains the observed gap in latency, tail behavior, and cost.

The lesson is simple

RAG is a latency-constrained retrieval problem, not a storage problem. Once semantic search becomes part of an online user surface, not a toy lookup utility, a vector database is no longer optional.

Evaluating Amazon S3 as a Vector Store for RAG Systems: Constraints and Performance Boundaries

Can Amazon S3 function as a vector store and eliminate the need for a dedicated vector database?

Conclusion

The lesson is simple