LLM·Dex
RAGGeminiLong context

Why Gemini 3 Pro Changed Our RAG Stack

2M-token context shifts the architecture. We rebuilt the pipeline. Here's what we kept, what we threw away, and what we learned.

By LLMDex Editorial

RAG architectures in 2024 were fundamentally constrained by context windows. You chunked documents, embedded them, retrieved the top-k, and stuffed them into a 32K or 128K window. Every step of that pipeline was a workaround for an underlying limitation: the model couldn't read your whole corpus.

Gemini 3 Pro's 1M-token native context (with 2M in preview) changed the constraint. We rebuilt our internal documentation-search pipeline from the ground up over three weeks. Here's the architectural diff and what we learned.

The old architecture

Standard RAG, circa 2024:

  • Documents chunked at ~500 tokens with 100-token overlap
  • Each chunk embedded with text-embedding-3-large
  • Vector store: Pinecone (later Qdrant)
  • At query time: embed the query, retrieve top-25, rerank to top-5 with Cohere Rerank, stuff into a 128K context, generate answer with GPT-4o

The pipeline worked, but it had four chronic pain points:

  1. Chunk-boundary errors. A relevant fact spans two chunks; retrieval gets one, the answer is wrong.
  2. Multi-hop questions. "Compare X across our top three customers" requires fetching three relevant passages and synthesizing, retrieval ranking would often miss one.
  3. Tone and style drift. The model couldn't see the surrounding document context, so summaries felt disconnected from the source.
  4. Maintenance. The chunk size and rerank model were levers that needed tuning per corpus. Quality drifted as content grew.

The new architecture

With Gemini 3 Pro's context window, we collapsed three of those pain points by deleting infrastructure:

  • Smaller, smarter chunks for retrieval. We still embed and retrieve, because pulling all 50,000 documents into the context for every query is wasteful. But chunks are now 2,000 tokens (4x larger), and we retrieve top-5 instead of top-25.
  • No rerank step. Gemini's long-context recall is good enough that the ranking quality of vector search alone is sufficient when we feed it 5 candidate chunks at 2K each.
  • Whole documents, not chunks, for synthesis. When the retrieval surfaces a chunk, we now load the whole document (median ~30K tokens, max ~200K) into the context window. The model sees the surrounding context naturally.
  • Multi-doc reasoning. For multi-hop questions, we pass 5 full documents, up to ~1M tokens, and let the model do the synthesis. No more chunk-boundary errors. No more retrieval misses on relevant supporting context.

The pipeline is simpler, the answers are better, and the cost actually went down on input-heavy workloads because Gemini 3 Pro's input pricing is aggressive ($2.50 per 1M).

What we kept

  • Vector search for retrieval. Still essential. We don't dump 50K documents per query, we use embeddings to find the relevant 5-10 documents, then load the full versions.
  • Embeddings. We switched from text-embedding-3-large to Gemini's text-embedding-004 for consistency, but the role is unchanged.
  • Citations in output. We require the model to cite source documents for every claim. This was already standard practice, long context made it easier, not harder.

What we threw away

  • Reranking. Cohere Rerank was a high-ROI line item in 2024. With 5 candidate documents in the context, ranking quality matters less than recall. We saved $200/month and a whole prompting layer.
  • Aggressive chunking. Smaller chunks meant more vector store rows, more embedding cost, and more chunk-boundary failures. Larger chunks plus full-document load eliminated all three.
  • Multi-step retrieval. We had a "first pass for entities, second pass for facts" pipeline. With long context, the model handles it natively in one prompt.

What surprised us

Three things we didn't expect:

1. Citations got better

We expected citation quality to degrade with more context, more places for the model to confuse. Instead, the opposite. With the full source document in context, the model could cite specific paragraphs accurately. Hallucinated citations dropped.

2. Latency increased less than expected

Loading a 200K-token context isn't free, but Gemini 3 Pro's first-token latency on long inputs is genuinely fast (sub-3 seconds in our setup). The time-to-answer for our users went up by maybe 1-2 seconds on heavy queries, which is well below the irritation threshold.

3. Cost went down

This was the biggest surprise. Our 2024 pipeline cost roughly $0.012 per query (vector search + rerank + GPT-4o generation). The new pipeline costs roughly $0.008 per query (vector search + Gemini 3 Pro generation), because Gemini's input pricing is so favorable that loading whole documents is cheaper than the rerank pass we eliminated.

Where long-context RAG still loses

Three failure modes:

  1. Truly massive corpora. If your corpus is 100M+ tokens, you can't load all candidate documents. Standard RAG with chunking still wins. We're a 50K-document org; YMMV.
  2. Latency-critical responses. If you need sub-1-second responses (chatbots, real-time interfaces), the long-context path is too slow. Use small-context with aggressive caching.
  3. Cost-bound free-tier products. Gemini 3 Pro's $2.50 / 1M input is competitive but not the cheapest tier. If you're serving anonymous free traffic, Gemini 3 Flash or GPT-5 nano are better picks.

The deeper takeaway

Long context isn't a marginal feature, it's an architectural shift. The whole RAG-pipeline complexity that grew up between 2022 and 2024 was a response to context-window scarcity. As scarcity disappears, that complexity is dead weight.

Three principles for RAG architectures in 2026:

  1. Retrieve broadly, generate with full context. Don't pre-chunk what the model can read in full.
  2. Trust the model with multi-doc synthesis. Long-context models are genuinely good at reasoning across multiple documents. The old wisdom of "small chunks, top-3 retrieval, single-doc generation" is artifact of older constraints.
  3. Pick on input pricing, not output pricing. Long-context workloads are input-heavy. The model with $2.50/1M input wins over $1.25/1M output for almost every RAG workload.

Gemini 3 Pro is the model that flipped these principles for us. Claude Opus 4.7 (200K), GPT-5/-5.5 (400K), and DeepSeek-V3 (128K) are all credible alternatives at smaller windows. Pick on your actual corpus size and per-query economics, not on a benchmark leaderboard.

Further reading

Keep reading

Friday digest

Intelligence, distilled weekly.

One short email every Friday, new model launches, leaderboard moves, and pricing drops. Curated by hand. Free, no spam.