The Five LLM Myths That Won't Die
Reasoning models hallucinate too. Open-weight is not always cheaper. And three more myths the AI Twitter consensus needs to retire.
The AI tooling space moves fast and the conventional wisdom doesn't always keep up. Five claims circulate as common knowledge that are either wrong, outdated, or so heavily caveated that the simple version misleads.
This article is a list of five myths we hear from technical buyers and engineers who should know better. Some of them used to be true. Some never were. All of them deserve to die.
Myth 1: Reasoning models don't hallucinate
The thinking-model paradigm, o3, o4, DeepSeek-R1, is sometimes presented as a hallucination fix. The pitch: a model that reasons step-by-step before answering is necessarily more accurate.
The reality is more nuanced. Reasoning models hallucinate too. They hallucinate in different ways: their chains-of-thought sometimes invent intermediate claims that don't actually follow from the prior step, then they confidently conclude based on the invented claim. The visible reasoning is comforting; it's not always correct.
What's true: reasoning models are dramatically better on hard math and structured logic problems where the right answer can be checked step-by-step. They're not magically resistant to hallucination on factual questions. For factual workloads, RAG with grounding citations beats raw reasoning models every time.
If you're building a knowledge product, don't pick a reasoning model assuming hallucination is solved. Pick the right tool, RAG for facts, reasoning for math/logic, and verify either way.
Myth 2: Open-weight models are always cheaper
Open-weight models like DeepSeek-V3, Llama 4, and Qwen 3 advertise dramatic per-token cost savings vs closed frontier. The arithmetic on a single API call looks decisive: $0.27 / $1.10 per million tokens for DeepSeek vs $1.25 / $10 for GPT-5.
But "cheaper per token" isn't "cheaper to deploy." Three asterisks:
- Self-hosting open-weight models is expensive. A 671B MoE needs 8+ H100s. The hardware doesn't pay back unless you're running serious volume.
- Hosted open-weight via Together/Fireworks/OpenRouter has provider markups. The "cheap" pricing is usually on the model lab's own API, which has different compliance constraints than enterprise-friendly Western providers.
- Quality gaps cost engineer time. A 5% lower resolution rate on a coding agent translates to engineer review time that often exceeds the savings on inference.
For high-volume narrow workloads, open-weight is genuinely cheaper. For typical mid-volume general-purpose use, the cost difference shrinks once all costs are counted. Don't assume "open" = "cheap" without doing the full math.
Myth 3: Bigger context windows always help
The 1M-token context window is impressive. It's not always useful. We wrote a whole article on this: for most workloads, 128K-200K is plenty, and the cost of a 1M-token query is real.
The myth that survived through 2025: bigger is always better. The reality is that context-window decisions should be made on workload, most queries are well under 50K tokens, and the queries that genuinely benefit from 1M-token contexts are a small fraction of typical traffic.
Pick context size by your actual usage distribution, not by the marketing leaderboard.
Myth 4: GPT-X dominates everything
There's a residual narrative, visible especially in mainstream tech press, that OpenAI's GPT models are unambiguously the leaders, end of story. This was true through GPT-4 in 2023. It hasn't been clearly true since.
In 2026, the frontier is a three-horse race:
- GPT-5.5 leads on tool-use ergonomics, ecosystem maturity, and agent product polish (Cursor, Cline, Claude Code, Operator are GPT-friendly first).
- Claude Opus 4.7 leads on writing, code review, and SWE-bench Verified.
- Gemini 3 Pro leads on long-context, vision, and document AI.
For any specific workload, the right answer is benchmarking on your own data, not picking the consensus "best" model. If your workload is RAG-heavy, Gemini wins. If it's coding-agent-shaped, Claude wins. If it's tool-use-heavy, GPT wins. The buyer who picks one model for "everything" is suboptimal at most things.
Myth 5: Vector databases are essential for RAG
Pinecone, Weaviate, Chroma, Qdrant, vector databases are deeply baked into the modern RAG stack. They're so deeply baked that the default question for "I want to do RAG" is "which vector DB."
The honest answer in 2026: not every RAG workload needs a vector DB. Three alternatives:
- Keyword + small reranker. For corpora under 10K documents, BM25 + Cohere Rerank is often as accurate as embeddings + vector search and dramatically simpler to operate.
- Long-context whole-document load. With Gemini 3 Pro's 1M window, you can load 10-20 full documents per query and let the model do the synthesis. No vector DB at all.
- In-memory embeddings. For corpora under 10K rows, an in-memory FAISS index in your application server is fine. You don't need Pinecone's cluster overhead.
Vector databases are genuinely the right answer for large-scale RAG (1M+ documents, multi-tenant deployments, sub-100ms retrieval). For everything else, simpler often beats fancier. Question the default.
A pattern: outdated heuristics survive their context
Each of these myths used to be true (or close to true) at some point and stuck around after the underlying reality changed. Reasoning models did less hallucinate when they first arrived (because they were better at math). Open-weight models were dramatically cheaper before frontier providers competed on price. Bigger context did help when the alternatives were 4K. GPT-X did dominate before Claude 3 caught up. Vector DBs were essential when context windows were small.
The lesson: when a heuristic feels old, it probably is. Stress-test the conventional wisdom against the current state of the field every six months. The field is moving fast; the conversation needs to keep up.
How to test a heuristic
Three quick checks for any received wisdom in this space:
1. When was this last true?
If you can't trace the claim to a specific time period when it was unambiguously true, it might never have been clean. "Reasoning models don't hallucinate" never had a clean window.
2. What's the counter-evidence?
A heuristic that has plausible counter-examples is at best a useful approximation, at worst a myth. "Bigger context always helps" has counter-examples (cost, latency, multi-needle quality). The heuristic is partial.
3. Does the answer depend on workload?
Most "always" claims in AI tooling fall apart under workload-specific testing. A model that's "best" on average is probably worse than the workload-specific best on any specific use case. Generalize cautiously.
What we'd add to the list
We considered adding three more but the list got long:
- "AI agents don't really work yet", they do for many workloads. Pick your workload carefully.
- "You need a frontier model for any serious work", increasingly false; mid-tier covers more than people credit.
- "Local models are still toys", six months ago this was true; today it's overstated. See On-device LLMs are finally here.
The pattern holds: if you've been working in this field for two years and your priors haven't updated, your priors are wrong.
Further reading
Keep reading
- Why We Built LLMDex
A short story about how an internal model-tracking spreadsheet became a public site, and what we learned along the way.
- The Two Rules of Honest AI Data
Don't fabricate. Don't omit context. The full editorial standard behind LLMDex's data and how to apply it to your own work.
- How to Read an AI Benchmark: A Skeptical Reader's Guide
MMLU, HumanEval, SWE-bench, GPQA, what they actually measure, how providers game them, and how to think about benchmark numbers in 2026.
Intelligence, distilled weekly.
One short email every Friday, new model launches, leaderboard moves, and pricing drops. Curated by hand. Free, no spam.