LLM·Dex
RAGArchitectureProduction

Production RAG Over a Million Documents: Architecture That Actually Works

What changes when your corpus is 1M+ documents instead of 1K. Embedding choices, retrieval strategy, infrastructure cost, and the corner cases that bite you at scale.

By LLMDex Editorial

Most RAG tutorials assume a corpus of 1,000 documents. The architecture they describe, chunk, embed, top-k retrieve, generate, works fine at that scale. It also works fine at 10,000 documents. It starts breaking around 100,000. By 1 million documents, almost every part of the standard RAG pipeline needs to change. The chunking strategy, the embedding model, the vector store, the retrieval algorithm, the reranking layer, the synthesis step, all of them.

This article is a working architecture for production RAG over a million-plus documents. It's based on real deployments we've worked on (legal corpora, customer support knowledge bases, internal documentation) and grounded in 2026's model landscape.

What changes at million-document scale

Three practical shifts.

Storage and indexing become real engineering

A million documents averaging 5KB each is 5GB of raw text. After embedding (typically 1KB per chunk for a 1024-dim float32 embedding, a few thousand chunks per document), you're at 5-50GB of vector data. This is no longer something that fits comfortably in a single Postgres database with pgvector, you need a real vector database (Pinecone, Weaviate, Qdrant) or a serious self-hosted setup.

Indexing time matters. Re-embedding the entire corpus when you change embedding models takes hours to days at this scale. Plan for it.

Retrieval quality dominates everything

At 1K documents, top-25 retrieval almost always contains the right answer somewhere, the synthesis step does the heavy lifting. At 1M documents, top-25 retrieval often misses the right document entirely. The retrieval layer becomes the binding quality constraint.

Two consequences: better embeddings matter more, and reranking matters more. Both add cost.

Per-query economics matter more

At 1K documents, you can afford to send 25 retrieved chunks to Claude Opus 4.7 for synthesis. At 1M documents with high query volume, the synthesis cost dominates your monthly bill. You need to think harder about per-query token economics.

The architecture

Here's a working pipeline for 1M-document scale:

Document processing

For each document:

  1. Parse to clean text. PDF, HTML, Markdown, Word, etc. all become Unicode text with structural cues preserved (headings, lists, tables).
  2. Identify document-level metadata. Title, author, date, source URL, document type, etc. This metadata is critical for filtering at retrieval time.
  3. Chunk thoughtfully. Not by fixed-size token windows. By semantic units, paragraphs, sections, list items. Larger chunks (1000-2000 tokens) work better at scale than smaller ones because they reduce the chunk-boundary error rate.
  4. Embed each chunk with a current embedding model. We'd recommend OpenAI's text-embedding-3-large, Cohere's embed-v3, or Voyage AI's voyage-3, all are competitive. Avoid older or smaller embeddings; the quality gap is real at scale.

The embedding model choice matters more than the vector database choice. Pinecone vs Weaviate vs Qdrant is a decision about operations and cost; embedding model is a decision about quality.

Storage

Use a real vector database. Pinecone serverless is the lowest-friction option for most teams. Self-hosted Qdrant or Weaviate are competitive on cost at scale (>10M vectors).

Three storage tips:

Filter at query time, not at retrieval time. Most vector databases support metadata filters. Use them aggressively. If a query is "what does our legal team say about X," you don't want to retrieve from engineering docs. Document-level filters cut retrieval ambiguity dramatically.

Reindex on embedding model upgrades. When a new generation of embedding models comes out (every 12-18 months), reindex. The quality gain is meaningful and your competitors will reindex too.

Sample evals before changing things. Maintain a held-out set of 100-500 query/expected-answer pairs. Run it whenever you change anything in the pipeline. Don't deploy changes that regress.

Retrieval

The retrieval algorithm matters at scale. Three patterns that work:

Pattern 1: Top-K dense retrieval + reranking. Retrieve top-50-100 from vector database, rerank with a cross-encoder (Cohere Rerank v3, Voyage rerank, or BGE-Reranker-V2 for self-host) to produce top-5-10 for synthesis. This is the default architecture and works well for most workloads.

Pattern 2: Hybrid (dense + keyword) retrieval. Combine dense vector retrieval with BM25 keyword retrieval. Merge the result sets, rerank to top-K. Helpful when queries contain specific identifiers (product names, error codes, legal citations) that pure dense retrieval can miss.

Pattern 3: Multi-step retrieval. First retrieve broadly, then use a small model to identify which retrieved documents are most relevant, then retrieve a deeper sample from those. Adds latency and cost but improves recall on multi-hop queries.

For most workloads, Pattern 1 is fine. Reach for Pattern 2 if you have queries that contain specific identifiers. Reach for Pattern 3 only if you have multi-hop questions ("compare X across Y") that Pattern 1 fails on.

Synthesis

This is where 2026 model choice matters. Three options:

Option A: Long-context whole-document synthesis with Gemini 3 Pro. Retrieve top-5 documents, load each in full into Gemini's 1M-token window, generate. This works well for analyst-style questions where the synthesis quality matters more than per-query cost. We've written about our own RAG rebuild on this pattern.

Option B: Chunk-level synthesis with Claude Sonnet 4.6 or GPT-5 mini. Retrieve top-15 chunks, send all of them to the model in a 30-50K-token context, generate. This is the lower-cost path and works well for chatbot-style use cases.

Option C: Hierarchical synthesis for very long-form answers. Retrieve top-50 chunks, group by document, synthesize per-document summaries first, then synthesize the final answer from the per-document summaries. Multi-step, more expensive, but produces better long-form output.

For most production RAG, Option B is the right default. Option A for analyst-style use cases. Option C for genuinely long-form output (multi-page reports, etc.).

Cost at scale

Order of magnitude per query:

  • Embedding the query: $0.00001 (negligible)
  • Vector retrieval (Pinecone serverless): $0.00005
  • Reranking (Cohere): $0.002 per query
  • Synthesis (Claude Sonnet 4.6, ~10K input + 1K output tokens): $0.04
  • Total: ~$0.04 per query.

At 100K queries/month, that's $4,000/month. At 1M queries/month, $40,000/month. Synthesis dominates the cost; rerank is the second-biggest line item.

The two biggest cost levers:

  • Cheaper synthesis model. GPT-5 mini drops the synthesis cost by 4-5x vs Sonnet 4.6 with usually-acceptable quality. DeepSeek-V3 drops it by another 2-3x.
  • Skip reranking on simple queries. Use a small model to classify "does this query need rerank?" If no, retrieve top-10 and synthesize directly.

The two biggest cost traps:

  • Long context on every query. Gemini's 1M-token RAG is great but expensive at high query volume. Use it only on queries that benefit.
  • Re-running expensive embeddings. If you embed the same document repeatedly (say, on document edits), batch updates rather than re-embedding live.

The corner cases that bite

Three things that always come up at scale:

Stale documents. A million-document corpus has documents that were correct in 2022 and aren't anymore. Without metadata filters by date, your bot will confidently cite outdated info. Solution: always include lastUpdated in document metadata; weight retrieval toward more-recent documents; surface dates in the synthesized answer so users can sanity-check.

Duplicate documents. Most large corpora have duplicates and near-duplicates. The same FAQ exists in three knowledge bases. Without dedup, top-K retrieval surfaces the same document three times and the synthesis is lossy. Solution: dedup at indexing time (content hash), or at retrieval time (cosine-similarity threshold).

Adversarial / private content. A million-document corpus often contains private user data, tickets, internal policy docs. Make sure your embedding pipeline doesn't accidentally surface private content for unauthorized users. Solution: per-document access-control metadata, enforced at retrieval time. Don't trust the LLM to filter, trust the database.

Concrete recommendation

If you're building production RAG over 1M+ documents, start here:

  1. Document processing: Semantic chunking, ~1500-token chunks, document-level metadata (title, date, type, access).
  2. Embeddings: OpenAI text-embedding-3-large or Voyage AI voyage-3.
  3. Vector store: Pinecone serverless (low ops) or Qdrant (lower long-term cost).
  4. Retrieval: Top-50 dense retrieval → Cohere Rerank v3 → top-5 for synthesis.
  5. Synthesis: Claude Sonnet 4.6 default; Gemini 3 Pro for long-form analyst queries; GPT-5 mini for cost-sensitive workloads.
  6. Evals: A held-out set of 200+ query/expected-answer pairs, run on every change.

The biggest mistake teams make is over-engineering before they have evals. Build the eval suite first, then tune the architecture against it. Don't tune blind.

Further reading

Keep reading

Friday digest

Intelligence, distilled weekly.

One short email every Friday, new model launches, leaderboard moves, and pricing drops. Curated by hand. Free, no spam.