/

Long-form deep dives

The LLMDex blog

Hand-written essays and analyses on the AI tooling landscape, model launches, benchmark drama, pricing economics, and things we've learned shipping LLMs to production.

UpdatedApr 30, 2026

AI Safety in Production: A Builder's Checklist
Prompt injection, data leakage, hallucination, and the operational practices that keep AI products from blowing up in your face.
Apr 30, 2026
Are Reasoning Models Worth the Cost?
o3, o4, DeepSeek-R1, GPT-5 thinking. They're slower and 5-20x more expensive per query. When does the quality bump pay back?
Apr 30, 2026
Building a Code-Review Bot in 2026: Architecture, Models, Pitfalls
A working playbook for shipping an AI code-review bot that engineers actually want. Models, prompts, latency, false-positive control, and the integration patterns that work.
Apr 30, 2026
Building a Research Agent That Actually Researches
Deep Research, Perplexity Pro Search, and the homebrew alternatives. What architecture works in 2026, what models to use, and the pitfalls that produce confident-sounding nonsense.
Apr 30, 2026
The Customer Support AI Playbook: Architecture, Models, KPIs
What actually works for AI customer support in 2026: triage routing, RAG over your knowledge base, escalation patterns, model picks, and the metrics that matter.
Apr 30, 2026
Google's Gemini Gambit: How Long Context Became a Strategic Moat
Why Google bet the Gemini line on 1M-2M token context windows when no other lab thought it mattered, and how that bet is reshaping RAG architectures across the industry.
Apr 30, 2026
GPT-5.5 Deep Dive: Everything OpenAI's Mid-Cycle Refresh Brings
Pricing, benchmarks, agent performance, and migration notes for OpenAI's GPT-5.5. What changed from GPT-5, what didn't, and when you should upgrade.
Apr 30, 2026
GPT-5 mini vs Claude Haiku 4: Which Mid-Tier Wins in 2026?
The two models that define the production sweet spot. We benchmarked, priced, and stress-tested both. Verdict by workload.
Apr 30, 2026
How OpenAI Does Pricing: A Tour Through Five Years of Per-Token Economics
From $0.06 / 1K tokens on GPT-3 to $0.05 / 1M tokens on GPT-5 nano. The full pricing history, the architectural shifts behind the cuts, and what they tell us about 2026.
Apr 30, 2026
How to Read an AI Benchmark: A Skeptical Reader's Guide
MMLU, HumanEval, SWE-bench, GPQA, what they actually measure, how providers game them, and how to think about benchmark numbers in 2026.
Apr 30, 2026
How We'd Hire an AI Engineer in 2026
What 'AI engineer' actually means, what to test for in interviews, what to pay, and the red flags that distinguish real engineers from prompt-tinkerers.
Apr 30, 2026
Inside Anthropic: How Claude Got Built and Where It's Going
A working engineer's read on Anthropic's research priorities, business model, and product roadmap. Why Claude wins on writing and code, and what it means for the rest of 2026.
Apr 30, 2026
Meta's Open-Weight Strategy: How Llama Reshaped the Frontier
Meta gave away frontier-quality model weights for two years straight. We unpack the strategic logic, what Llama 4 actually changed, and what's next for the open-weight ecosystem.
Apr 30, 2026
Mistral's European Angle: Why It Still Matters in 2026
Mistral has been the European AI lab nobody expects to keep up, and yet it does. We unpack the technical, regulatory, and commercial reasons it remains relevant against bigger US labs.
Apr 30, 2026
Production RAG Over a Million Documents: Architecture That Actually Works
What changes when your corpus is 1M+ documents instead of 1K. Embedding choices, retrieval strategy, infrastructure cost, and the corner cases that bite you at scale.
Apr 30, 2026
Self-Hosting a 70B Model on a Single H100: A 2026 Playbook
Yes, you can serve Llama 4 70B on one H100 at production speed. Quantization, serving stack, throughput tuning, and the operational realities.
Apr 30, 2026
The State of Open-Source LLM Tooling in 2026
What's actually production-ready vs research-grade across the open-weight serving, training, fine-tuning, and observability stack.
Apr 30, 2026
The Complete History of GPT: From GPT-1 to GPT-5.5 (2018, 2026)
How OpenAI's GPT line evolved from a 117M-parameter research model in 2018 to GPT-5.5, pricing, parameters, capabilities, and the moments that mattered.
Apr 30, 2026
The Real Economics of Self-Hosting LLMs in 2026
When self-hosting beats commercial APIs on cost, when it doesn't, and the operational realities most teams underweight.
Apr 30, 2026
Voice Agents That Don't Feel Slow: A 2026 Architecture
Sub-800ms end-to-end voice agents are achievable in 2026. STT, LLM routing, TTS, latency budgets, and the architectural moves that make voice feel natural.
Apr 30, 2026
What 'AI-Native' Actually Means in 2026
Every SaaS company claims to be AI-native. Most aren't. Here's how to tell, for hiring, for product strategy, for buying decisions.
Apr 30, 2026
What GPT-6 Will Probably Look Like (and When)
OpenAI hasn't announced GPT-6 yet. Based on the patterns from GPT-3 → 4 → 5, here's what to expect: capabilities, timing, pricing, and what it means for builders.
Apr 30, 2026
What We Got Wrong Building LLMDex (and What We'd Do Differently)
An honest postmortem from 18 months of building a programmatic SEO site for AI tools. The architectural mistakes, the editorial misjudgments, and what we'd do differently.
Apr 30, 2026
When AI Agents Replace SaaS
Half of the SaaS layer is replaceable by an LLM with the right tools. Which half, and what the timeline looks like.
Apr 30, 2026
Why DeepSeek's MoE Architecture Matters More Than You Think
DeepSeek-V3 is 671B parameters that activate 37B per token. We unpack what that means for inference cost, training economics, and why every Western lab quietly switched architectures in 2024.
Apr 30, 2026
Why Most LLM Latency Optimizations Don't Work
P50 fast, P99 awful. Why the standard latency-optimization advice fails for production AI products, and what actually moves the needle.
Apr 30, 2026
xAI: Real-Time Data as a Moat
Grok's most distinctive feature isn't model quality. It's the X firehose. We unpack what that means competitively, and where xAI sits in the broader AI landscape in 2026.
Apr 30, 2026
MCP Servers Actually Changed Things
Six months in, the Model Context Protocol is what every editor agent uses. Here's what's working, what isn't, and how to write your own.
Apr 29, 2026
The State of Coding Agents in 2026
Cursor, Cline, Aider, Claude Code, GitHub Copilot Agent, six months of dogfooding, side-by-side. What works, what doesn't, what's next.
Apr 22, 2026
Red-Teaming as a Day Job
What it looks like to run adversarial-prompt evaluations professionally, and why it pays well.
Apr 19, 2026
The Real Cost of Running a Coding Agent in Production
We instrumented a real codebase agent for a quarter. Here's what each model actually costs, and why per-token rates lie.
Apr 15, 2026
Choosing an Eval Framework in 2026
Inspect, OpenAI Evals, LangSmith, Ragas. Pick correctly the first time. A working engineer's comparison.
Apr 8, 2026
How LLMDex Tracks Benchmark Honesty
The internal process for sourcing, verifying, and de-listing benchmark numbers, and why we leave fields blank.
Apr 1, 2026
The AI Tool Procurement Checklist
If you're a CTO buying an AI assistant for the company, here's the list to run before you sign. License, security, data handling, vendor stability.
Mar 26, 2026
Why Gemini 3 Pro Changed Our RAG Stack
2M-token context shifts the architecture. We rebuilt the pipeline. Here's what we kept, what we threw away, and what we learned.
Mar 25, 2026
DeepSeek-V3 Is Actually Good (and Cheap)
We avoided open-weight frontier models for two years. DeepSeek-V3 ended that. A blunt evaluation of what V3 does, where it loses, and when to pick it.
Mar 18, 2026
The Five LLM Myths That Won't Die
Reasoning models hallucinate too. Open-weight is not always cheaper. And three more myths the AI Twitter consensus needs to retire.
Mar 11, 2026
Should You Fine-Tune or Just Prompt?
A decision framework, with cost numbers, for when fine-tuning beats clever prompting in 2026.
Mar 4, 2026
LLM Pricing Doesn't Make Sense (and Maybe Shouldn't)
The economics of frontier inference are weirder than they look. A primer on tokens, MoEs, and why prices keep dropping by 90% per year.
Feb 26, 2026
Prompt Libraries Are Overrated
Curated prompts age fast. Here's a more durable pattern for building production-grade prompts that survive model upgrades.
Feb 19, 2026
The Quiet Rise of Tool-Use Benchmarks
BFCL, t-bench, Tau-bench. Why function-calling evals matter more than MMLU now, and how to read them.
Feb 12, 2026
The Context Window Arms Race Is Over
2M tokens is enough. The frontier moved to needle-in-haystack and reasoning over haystack. Why bigger context isn't the next big thing.
Feb 4, 2026
On-Device LLMs Are Finally Here
Phi-4, Llama-4-8B, Qwen-2.5-7B running on a MacBook. What works, what doesn't, what's next for local inference.
Jan 28, 2026
Beehiiv vs Substack for an AI Newsletter
We migrated. Here's what each platform got right and where they fell short for technical content. Why we ended up where we did.
Jan 21, 2026
Why We Built LLMDex
A short story about how an internal model-tracking spreadsheet became a public site, and what we learned along the way.
Jan 14, 2026
The Two Rules of Honest AI Data
Don't fabricate. Don't omit context. The full editorial standard behind LLMDex's data and how to apply it to your own work.
Jan 7, 2026

Friday digest

One short email every Friday, new models, leaks, and quietly-shipped APIs you missed.