RAG Architecture Blueprint
for Enterprises
From proof of concept to production: the complete technical blueprint covering every design decision that separates enterprise-grade RAG from a weekend project.
End-to-End RAG Architecture
Two parallel pipelines — Ingestion and Query — that must be designed together from day one.
The Gap Between a RAG Demo and a RAG System
Every developer has built a RAG demo. You chunk a PDF, embed chunks, store them in a vector database, retrieve top-k on a query, and inject into a prompt. The demo works. Then you put it in front of real users with real documents — and it starts failing in ways that are difficult to diagnose and harder to fix.
Demo
Generic parser, fixed-size chunks, single embedding model, top-5 retrieval
Production System
Layout-aware parsing, semantic chunking, hybrid search, reranking, evaluation framework
Chunking Strategy — The Most Consequential Decision
Chunking determines what the retriever sees. Fixed-size chunking — split every 512 tokens with 64-token overlap — is the default and the worst production choice for most enterprise content.
Fixed-size
Split at N tokens with overlap. Simple to implement, poor semantic coherence. Use only as a baseline to beat.
❌ Avoid in productionSemantic Chunking
Detect topic shifts by measuring embedding similarity between consecutive sentences. Splits at semantic boundaries. Best for long-form reports and policy documents.
✅ RecommendedHierarchical
Generate chunks at multiple granularities (paragraph, section, document). Retrieve on fine-grained, return coarser parent for context. Highest precision + context richness.
✅ Best for enterpriseStructure-aware
Use document headings, section breaks, and list items as natural boundaries. Best for consistently formatted documents like SOPs, contracts, and manuals.
✅ RecommendedVector Database Selection
The differences between mature vector databases are smaller than the differences between good and bad chunking. Select on scale, filtering, hybrid search support, and deployment model.
| Database | Model | Hybrid Search | Best For |
|---|---|---|---|
| pgvector | Self-hosted | BM25 via pg extension | Teams avoiding new infra; existing Postgres users |
| Weaviate | Managed / Self | Native hybrid (BM25 + dense) | Full-featured, rich schema, GraphQL API |
| Qdrant | Managed / Self | Native hybrid | High-speed, Rust performance, filtering |
| Pinecone | Managed only | Sparse-dense via namespaces | Simplest managed experience, lowest ops overhead |
| OpenSearch | Self-hosted | kNN + BM25 | Teams with existing Elastic/OpenSearch stack |
Hybrid Search + Reranking
Why Hybrid Search
Pure semantic search misses exact matches for specific codes and proper nouns. Pure keyword search misses paraphrases. Hybrid — combining both with Reciprocal Rank Fusion — consistently outperforms either approach alone on heterogeneous enterprise content.
Implement at the database level, not the application level — it's faster, lower latency, and doesn't require manual score normalization.
Two-Stage Reranking
Stage 1 (Retrieval): Fast bi-encoder similarity — retrieves a broad candidate set. Optimizes for recall. Runs at scale.
Stage 2 (Reranking): Accurate cross-encoder scoring of query-document pairs jointly. Too slow for retrieval scale, but runs on the small candidate set. Options: CohereRerank, BGE Reranker, Jina Reranker.
Building or Scaling a RAG System?
Indigloo builds enterprise RAG on Vertex AI Embeddings, AlloyDB, and Vertex AI Vector Search — with hierarchical chunking, hybrid search, reranking, and a built-in evaluation framework as standard.
Discuss Your RAG Architecture