Building a Research Agent That Actually Researches
Deep Research, Perplexity Pro Search, and the homebrew alternatives. What architecture works in 2026, what models to use, and the pitfalls that produce confident-sounding nonsense.
A research agent is the most ambitious shape of AI assistant: take a research question, search the web (and possibly internal data), gather sources, synthesize, and produce a structured output. OpenAI's Deep Research, Anthropic's research workflows, Perplexity Pro Search, and Google Gemini's Deep Research are all production examples. They mostly work. They occasionally produce confident, plausible nonsense.
This piece is a working architecture for shipping a research agent that produces output worth reading. It walks through the architecture, the model choices, and the failure modes that distinguish "useful research tool" from "confident hallucination machine."
What's hard about research agents
Three things separate research agents from simpler RAG pipelines:
Open-ended scope. A RAG pipeline answers questions whose answers exist in your corpus. A research agent answers questions whose answers are scattered across the web in unpredictable places. The retrieval problem is fundamentally harder.
Source quality variance. Web sources include New York Times articles, Reddit threads, marketing fluff, deliberate disinformation, and AI-generated SEO spam. The agent has to distinguish authoritative sources from junk and weight accordingly.
Synthesis depth. Output is typically a 3-15 page report, not a paragraph. Maintaining coherence, avoiding contradictions, and producing genuinely insightful synthesis across many sources is hard.
The architecture has to handle all three.
The architecture
A research agent that actually works has roughly six stages:
- Question decomposition. Take the user's open-ended question and break it into 5-15 specific sub-questions.
- Per-sub-question search. For each sub-question, run a web search and retrieve candidate sources.
- Source evaluation. Filter candidates by source authority, recency, and relevance.
- Per-source extraction. Read each authoritative source and extract relevant facts.
- Cross-source synthesis. Combine extracted facts into a coherent answer to each sub-question.
- Final synthesis. Combine sub-question answers into a final report.
Each stage has its own model choices and pitfalls.
Question decomposition
Use a flagship model (Claude Opus 4.7, GPT-5.5) for this. The decomposition quality determines everything downstream.
The output should be 5-15 specific, searchable sub-questions. Bad decomposition: "Tell me about X" → "What is X?", "Why is X important?", too generic to search well. Good decomposition: "Tell me about X" → "Who is the leading research group on X as of 2025?", "What was published on X at NeurIPS 2024?", "What companies are commercializing X?", etc.
Tune the prompt to produce specific, factual sub-questions. Add a few-shot example or two of well-decomposed questions.
Per-sub-question search
For each sub-question, run a web search. The big choices:
Search engine. Brave Search API, Bing Search API, or specialized search APIs like Tavily are the standard options. Tavily is the most-used because it's tuned for AI agents specifically, it returns relevant snippets and often pre-extracts content.
Number of results. Retrieve 10-20 candidates per sub-question. You'll filter aggressively in the next step.
Date range. Most research questions benefit from recent sources. Filter to last 12-24 months unless the question is historical.
Source evaluation
This is the under-invested part of most agent stacks. Bad source evaluation produces confident-sounding nonsense.
Three filters that matter:
Domain authority. Maintain a list of "high-trust" domains (NYT, BBC, academic institutions, official documentation, established trade publications) and a list of "always-discard" domains (known content farms, AI-generated spam sites). Score sources accordingly.
Source type. A blog post, a research paper, a press release, and a news article are all different. Prefer primary sources. Discard press releases as facts (use them only to identify primary sources).
Recency for fast-moving topics. A 2023 article on LLM benchmarks is likely outdated. Maintain a "topic decay" model, for AI/tech topics, prefer last-12-months sources. For historical topics, this filter is unnecessary.
The output of this stage is a filtered list of 3-7 high-quality sources per sub-question. If the filtered list is empty, surface that to the user, "I couldn't find authoritative sources for X" is a valid output.
Per-source extraction
For each filtered source, extract the relevant facts. Use a mid-tier model (GPT-5 mini, Claude Sonnet 4.6) for this, flagship is overkill for extraction.
The output is a structured fact list per source: claim, supporting evidence (quote), confidence score.
Two pitfalls:
- Hallucinating extractions. The model invents a quote that's not actually in the source. Mitigation: have the model cite specific text from the source; verify the citation against the actual source text.
- Missing nuance. A source says "X is true under conditions Y." The extraction drops the conditions. Mitigation: longer extraction prompts that ask for caveats and context.
Cross-source synthesis
For each sub-question, combine the extracted facts from multiple sources into a single answer. Note conflicts between sources explicitly.
The output should include:
- The answer to the sub-question
- Citations to specific sources
- Notes on disagreements between sources
- Confidence rating
Use a flagship model (Claude Opus 4.7, GPT-5.5) for synthesis. The reasoning across sources is hard.
Final synthesis
Combine sub-question answers into a final report. Structure matters, a report with clear sections, internal cross-references, and consistent tone is dramatically more useful than a wall of text.
Output format options:
Markdown report. Universal, easy to render anywhere, supports basic structure.
Structured JSON + rendering. More complex but enables better downstream tooling. Returns structured sections, references, and metadata.
Multi-modal. For research questions that benefit from charts, diagrams, or images, generate or include them. This pushes architectural complexity but is sometimes worth it.
Use a flagship model for final synthesis. The coherence and insight at this stage determines whether the report is useful.
The model choices in 2026
Three production-viable stacks:
OpenAI-native: GPT-5.5 for decomposition and synthesis, GPT-5 mini for extraction, Tavily for search, OpenAI's web tool integration. Smoothest end-to-end experience. This is what OpenAI's own Deep Research uses.
Anthropic-native: Claude Opus 4.7 for decomposition and synthesis, Claude Sonnet 4.6 for extraction, Tavily or Brave Search for web. Slightly higher synthesis quality; more complex to integrate web search.
Cost-optimized: Gemini 3 Pro for everything (using its 1M-token context to load full retrieved articles), Tavily for search. Cheaper than the others, especially on long synthesis. Some quality compromise on the hardest reasoning steps.
For most teams, the OpenAI-native stack is the right starting point. Reach for Anthropic if synthesis quality is the binding constraint. Reach for Gemini if cost is.
Common failure modes
Three failure modes that destroy research-agent trust:
Hallucinated citations. The agent cites "Smith et al. 2023" with a plausible-sounding quote. The paper doesn't exist or doesn't say that. Mitigation: enforce that every claim cites a specific source URL; verify URLs resolve; verify quoted text exists in the source.
Source-blindness to bias. The agent treats a marketing post and an academic paper as equivalently authoritative. Output is biased toward whichever sources happened to rank in search. Mitigation: explicit source-authority scoring; weight synthesis toward high-authority sources.
Confident incoherence. The agent produces 10 pages that sound plausible but contain internal contradictions. Mitigation: post-synthesis self-review pass, have a separate inference call check the final report for contradictions and flag them.
What "good" output looks like
A useful research-agent report has three properties:
Cited. Every factual claim has a specific source citation, and the citations resolve.
Caveated. Areas of uncertainty are flagged ("Sources disagree on X"; "Most recent data is from 2023"). Confident assertions are reserved for well-supported facts.
Actionable. The report ends with explicit conclusions or recommendations, not just a summary of sources. The user shouldn't have to re-read the whole report to extract the takeaway.
A bot that produces all three on most queries is genuinely useful. A bot that misses any one is mostly noise.
Where research agents are headed
Three trends through 2026-2027:
Multi-modal sources. Agents that can read PDFs, watch videos, and analyze images will dominate research workflows in domains where those formats matter (medical research, legal precedent, market analysis).
Custom corpora integration. Hybrid agents that combine web research with the user's private documents (Notion, Google Drive, internal databases) are increasingly common. The integration layer is non-trivial but the value is real.
Realtime research. Agents that monitor sources continuously and surface insights as they arrive (vs being asked one-off questions) are an emerging product category. Expect mature products by end of 2026.
Concrete recommendation
If you're shipping a research agent in 2026, start here:
- Architecture: Decomposition → search → source evaluation → extraction → cross-source synthesis → final synthesis.
- Models: GPT-5.5 for decomposition and final synthesis, GPT-5 mini for extraction, Tavily for search.
- Source evaluation: Maintain a domain-authority list. Filter aggressively. Discard junk early.
- Output structure: Markdown report with explicit citations, caveats, and actionable conclusions.
- Verification: Self-review pass on final output. Verify citations resolve.
The hardest part of a research agent isn't building the pipeline. It's tuning the source evaluation and verification stages so the bot doesn't produce confident nonsense. Invest there. The rest is plumbing.
Further reading
Keep reading
- The Customer Support AI Playbook: Architecture, Models, KPIs
What actually works for AI customer support in 2026: triage routing, RAG over your knowledge base, escalation patterns, model picks, and the metrics that matter.
- Building a Code-Review Bot in 2026: Architecture, Models, Pitfalls
A working playbook for shipping an AI code-review bot that engineers actually want. Models, prompts, latency, false-positive control, and the integration patterns that work.
- Production RAG Over a Million Documents: Architecture That Actually Works
What changes when your corpus is 1M+ documents instead of 1K. Embedding choices, retrieval strategy, infrastructure cost, and the corner cases that bite you at scale.
Intelligence, distilled weekly.
One short email every Friday, new model launches, leaderboard moves, and pricing drops. Curated by hand. Free, no spam.