LLM·Dex
PricingEconomicsInference

LLM Pricing Doesn't Make Sense (and Maybe Shouldn't)

The economics of frontier inference are weirder than they look. A primer on tokens, MoEs, and why prices keep dropping by 90% per year.

By LLMDex Editorial

LLM pricing in 2026 is a strange market. The same task that cost $30 in API fees on GPT-4 in mid-2023 costs around $1 on equivalent-quality models today, and the cheapest tier, GPT-5 nano at $0.40 per million output tokens, would have been free-tier-only economics three years ago.

If you're trying to plan an annual AI infrastructure budget, this is whiplash. If you're trying to understand why prices keep dropping, this is the article. We'll walk through the cost structure of inference, the architectural shifts that drove the price collapse, and what to expect through the rest of 2026.

What you actually pay for

When you're billed $10 per million tokens on GPT-5, that's a usage fee for a chain of underlying costs:

  1. Compute (GPU-hours), the raw cost of running the model on H100/H200 hardware. Frontier models need 8+ GPUs for a single inference replica.
  2. Memory bandwidth, bigger context windows demand more VRAM, which is expensive and constrained.
  3. Operational overhead, engineering, ops, security, datacenter, electricity. This is where economies of scale matter.
  4. R&D amortization, the model cost hundreds of millions of dollars to train. That cost is spread across every billed token forever.
  5. Profit margin, at frontier prices, this is real but smaller than you'd think.

Closed-frontier providers (OpenAI, Anthropic, Google) bundle all five into a single per-token price. Open-weight providers (Together, Fireworks, OpenRouter) pay zero for #4, the model is free, and price competitively on #1-3 plus a thin margin.

Why prices keep dropping

Three structural shifts:

1. MoE architectures

Mixture-of-experts models activate only a subset of parameters per token. DeepSeek-V3 is 671B parameters but activates 37B per token. From the user's perspective the model is huge; from the provider's perspective the inference cost is closer to a 37B model.

Every major lab has shipped MoE architectures since 2024. The economics of frontier inference are now closer to mid-tier inference, which is why pricing across tiers compressed dramatically.

2. Quantization and serving improvements

vLLM, SGLang, and TensorRT-LLM all shipped major throughput improvements in 2024-2025. Mid-2023 inference stacks managed maybe 1,000 tokens/sec per H100 on a frontier model. 2026 stacks routinely hit 4,000+ on the same hardware via continuous batching, paged attention, and INT8/FP8 quantization.

Throughput-per-GPU is the lever that compounds: doubling throughput halves cost.

3. Competition

OpenAI's near-monopoly through 2023 sustained higher prices. By 2026, four labs (OpenAI, Anthropic, Google, Meta) ship top-tier closed and open models, plus DeepSeek and several Chinese labs at the frontier. Customers can switch cheaply (especially via OpenRouter's universal API), so providers price competitively.

The result: input pricing on flagship models compressed from $30/1M (GPT-4 launch) to roughly $2-5/1M today.

Why the cheap tier is so cheap

Sub-$0.05/1M-token tiers (GPT-5 nano, Gemini 3 Flash, Amazon Nova Micro) are made possible by three more shifts:

  • Model distillation. A small model trained to imitate a frontier teacher can capture much of the quality at a fraction of the size.
  • Token economics. Routing-style use cases ("classify this into one of five categories") need 30 output tokens. At $0.40/1M, that's $0.000012 per call. Real money to build a product around requires aggressive volume.
  • Bundled cloud economics. Amazon Nova Micro is cheap partly because AWS bundles inference into broader cloud spend and is willing to operate at razor margins.

If you're building a high-volume routing layer, the cheap tier is genuinely usable in 2026. We track the leaders at Cheapest LLMs.

Why pricing is still confusing

Three things make LLM pricing harder than other infrastructure pricing:

1. Input vs. output asymmetry

Output tokens are 4-10× more expensive than input tokens on most models. This is because output requires sequential generation (each token depends on the previous one), while input can be processed in parallel.

Practical implication: input-heavy workloads (RAG, summarization) are dramatically cheaper than output-heavy workloads (long-form generation, agent loops). Same model, completely different cost profile.

2. Cached input pricing

OpenAI, Anthropic, and Google all offer cached-input pricing, typically 50% off, when the same prefix is reused within a window. For chatbots with system prompts or RAG with stable contexts, this can halve total cost. Most users don't enable it because it's not on by default.

3. Reasoning tokens

Reasoning models (o3, o4, DeepSeek-R1) bill for the reasoning tokens you don't see. A query that returns a 200-token answer might bill 5,000 reasoning tokens. Per-query cost on reasoning models is unpredictable in a way that breaks budgeting.

What to expect through 2026

Three predictions, with confidence levels:

High confidence: per-token prices keep falling

The annual price drop has been roughly 90% per year for two years. Most of the structural reasons (MoE, throughput, competition) are still in play. Expect another 50%+ price drop on flagship pricing through end-2026, more on mid-tier.

Medium confidence: cached-input becomes default

The pricing complexity of cached vs uncached input is a UX wart. Expect at least one major provider to make caching opaque to the user, billing the cached price automatically when applicable.

Low confidence: free tiers expand

Models cheap enough to give away for free are arriving. Some providers will expand free tiers to capture distribution. Others will not. The split depends on whether the market values free distribution more than direct revenue, which is a strategic question, not a technical one.

How to budget

Three rules of thumb for AI infrastructure budgeting in 2026:

  1. Budget for last quarter's prices. You'll come in under, every time. Pricing falls faster than budgets adjust.
  2. Default to mid-tier. GPT-5 mini, Claude Sonnet 4.6, Gemini 3 Flash. They're 10-20% the cost of flagships and 80% of the quality. Reach for flagship only when an eval shows you need it.
  3. Pay attention to input/output ratio. A workload that's 95% input tokens (RAG, summarization) costs dramatically less than the headline price suggests. A workload that's 95% output tokens (long-form generation) costs dramatically more.

The deeper takeaway

LLM pricing isn't a static thing you can pin down. It's a market in flux, with the underlying technology (MoE, distillation, throughput) moving faster than pricing pages do. The right strategy is to architect for portability, multi-provider, per-token-aware, with budget alarms tracking actual spend rather than expected, and re-evaluate quarterly.

The teams that win are the ones treating LLM pricing as a variable to optimize, not a fixed cost. Prices are still falling. Plan accordingly.

Further reading

Keep reading

Friday digest

Intelligence, distilled weekly.

One short email every Friday, new model launches, leaderboard moves, and pricing drops. Curated by hand. Free, no spam.