Architecture Blueprint · Enterprise Edition

RAG Architecture Blueprint
for Enterprises

From proof of concept to production: the complete technical blueprint covering every design decision that separates enterprise-grade RAG from a weekend project.

20 min read March 2026 Architecture Blueprint
Full Pipeline

End-to-End RAG Architecture

Two parallel pipelines — Ingestion and Query — that must be designed together from day one.

INGESTION PIPELINE SOURCES 📄 PDF 📝 Word / PPT 🌐 Web Pages 🗄️ Databases 📧 Email / CRM PARSER Layout-aware Table extraction OCR for scans CHUNKER Semantic split Hierarchical Structure-aware EMBEDDER Dense vectors Domain-tuned Batch ingestion VECTOR STORE Dense index Sparse index Metadata store Access control pgvector · Weaviate METADATA STORE Source, date, author Access permissions Chunk parent links QUERY PIPELINE USER QUERY Natural language Any complexity QUERY XFORM HyDE Decomposition Expansion HYBRID RETRIEVER Dense search Sparse (BM25) RRF fusion RERANKER Cross-encoder Top-k precision Score normalise LLM GENERATOR Context-grounded Faithfulness check Citation aware GROUNDED ANSWER With citations Faithfulness verified EVALUATION FRAMEWORK (RAGAS Metrics) Faithfulness Claims in answer vs. retrieved context Answer Relevancy Does answer address the actual question? Context Precision Signal-to-noise ratio in retrieved chunks Context Recall Ground truth coverage by retrieval Latency / Cost p95 response time tokens per answer Run automatically on golden dataset at every pipeline change — before any production deployment
🎯

The Gap Between a RAG Demo and a RAG System

Every developer has built a RAG demo. You chunk a PDF, embed chunks, store them in a vector database, retrieve top-k on a query, and inject into a prompt. The demo works. Then you put it in front of real users with real documents — and it starts failing in ways that are difficult to diagnose and harder to fix.

Demo

Generic parser, fixed-size chunks, single embedding model, top-5 retrieval

Production System

Layout-aware parsing, semantic chunking, hybrid search, reranking, evaluation framework

✂️

Chunking Strategy — The Most Consequential Decision

Chunking determines what the retriever sees. Fixed-size chunking — split every 512 tokens with 64-token overlap — is the default and the worst production choice for most enterprise content.

Fixed-size

Split at N tokens with overlap. Simple to implement, poor semantic coherence. Use only as a baseline to beat.

❌ Avoid in production
Semantic Chunking

Detect topic shifts by measuring embedding similarity between consecutive sentences. Splits at semantic boundaries. Best for long-form reports and policy documents.

✅ Recommended
Hierarchical

Generate chunks at multiple granularities (paragraph, section, document). Retrieve on fine-grained, return coarser parent for context. Highest precision + context richness.

✅ Best for enterprise
Structure-aware

Use document headings, section breaks, and list items as natural boundaries. Best for consistently formatted documents like SOPs, contracts, and manuals.

✅ Recommended
🗄️

Vector Database Selection

The differences between mature vector databases are smaller than the differences between good and bad chunking. Select on scale, filtering, hybrid search support, and deployment model.

Database Model Hybrid Search Best For
pgvector Self-hosted BM25 via pg extension Teams avoiding new infra; existing Postgres users
Weaviate Managed / Self Native hybrid (BM25 + dense) Full-featured, rich schema, GraphQL API
Qdrant Managed / Self Native hybrid High-speed, Rust performance, filtering
Pinecone Managed only Sparse-dense via namespaces Simplest managed experience, lowest ops overhead
OpenSearch Self-hosted kNN + BM25 Teams with existing Elastic/OpenSearch stack
🔀

Hybrid Search + Reranking

Why Hybrid Search

Pure semantic search misses exact matches for specific codes and proper nouns. Pure keyword search misses paraphrases. Hybrid — combining both with Reciprocal Rank Fusion — consistently outperforms either approach alone on heterogeneous enterprise content.

Implement at the database level, not the application level — it's faster, lower latency, and doesn't require manual score normalization.

Two-Stage Reranking

Stage 1 (Retrieval): Fast bi-encoder similarity — retrieves a broad candidate set. Optimizes for recall. Runs at scale.

Stage 2 (Reranking): Accurate cross-encoder scoring of query-document pairs jointly. Too slow for retrieval scale, but runs on the small candidate set. Options: CohereRerank, BGE Reranker, Jina Reranker.

Building or Scaling a RAG System?

Indigloo builds enterprise RAG on Vertex AI Embeddings, AlloyDB, and Vertex AI Vector Search — with hierarchical chunking, hybrid search, reranking, and a built-in evaluation framework as standard.

Discuss Your RAG Architecture