LLM·Dex
OpenAIPricingEconomics

How OpenAI Does Pricing: A Tour Through Five Years of Per-Token Economics

From $0.06 / 1K tokens on GPT-3 to $0.05 / 1M tokens on GPT-5 nano. The full pricing history, the architectural shifts behind the cuts, and what they tell us about 2026.

By LLMDex Editorial

In June 2020, OpenAI launched the GPT-3 API at $0.06 per 1,000 tokens for the largest model. In April 2025, GPT-5 nano launched at $0.05 per million input tokens. That's a 1,000-fold reduction in headline pricing across roughly five years, on output that is, by every meaningful benchmark, dramatically more capable. If you tried to predict that trajectory in 2020, you'd have been called a permabull and probably wrong on the timing. It happened anyway.

This article walks the OpenAI pricing history end-to-end, identifies the architectural and competitive forces that drove each major cut, and uses the pattern to make some defensible predictions about where pricing goes through end of 2026.

The pricing history

OpenAI's API pricing has gone through four distinct regimes.

Phase 1: Premium positioning (Jun 2020, Mar 2022). GPT-3 launched at four tiers: Ada ($0.0008 / 1K tokens), Babbage ($0.0012), Curie ($0.0060), and Davinci ($0.0600). Davinci was the production model most apps shipped with. Pricing was flat, input and output were billed the same, and there was no caching. The competitive landscape was effectively empty: Cohere existed but was tiny, AI21 was research-only, and Google hadn't launched a competitive API. Pricing reflected monopoly economics.

Phase 2: ChatGPT-driven scaling (Nov 2022, May 2023). ChatGPT launched in November 2022 and the GPT-3.5 family (text-davinci-003, then gpt-3.5-turbo) appeared shortly after. gpt-3.5-turbo at $0.002 / 1K tokens was a 30x price cut over Davinci on equivalent quality, made possible by inference-stack improvements and the new transformer-MoE serving optimizations. This was the cut that made AI products economically viable for consumer apps for the first time.

Phase 3: GPT-4 era and the input/output split (Mar 2023, Aug 2025). GPT-4 launched at $0.03 / 1K input, $0.06 / 1K output, the first time OpenAI split input and output pricing meaningfully. The reasoning was technical: output tokens are sequential and require more compute per token than input, which can be processed in parallel. GPT-4 Turbo (Nov 2023) cut to $0.01 / $0.03. GPT-4o (May 2024) further to $0.005 / $0.015. GPT-4o-mini (Jul 2024) hit $0.00015 / $0.00060, a price point that made aggressive routing-style use cases economically viable. By GPT-4o-mini, OpenAI had cut headline GPT-4-class pricing by ~95% in 14 months.

Phase 4: Unified GPT-5 tier ladder (Aug 2025, present). GPT-5 launched at three tiers: full GPT-5 ($1.25 / $10 per 1M tokens), mini ($0.25 / $2), and nano ($0.05 / $0.40). The pricing structure consolidated five years of cuts into a clean ladder: nano replaces 2024's mid-tier as the cheap option, mini replaces what GPT-4-Turbo was as the workhorse, full GPT-5 sits at the price point GPT-4o launched at. GPT-5.5 in March 2026 ships at parity to GPT-5 with quality improvements across the board.

If you graphed effective price per quality-unit (proxied by, say, MMLU score per dollar of compute), the trajectory is a roughly 90% drop per year over the GPT-3 → GPT-5.5 era. That rate has held for five consecutive years.

What's actually driving the cuts

Three structural forces explain the trajectory.

Architectural improvements

Mixture-of-experts is the biggest. A frontier dense model with N parameters costs O(N) compute per inference token. A frontier MoE with N total parameters and K active experts (typically K is 8% of N) costs O(K·N), roughly an order of magnitude less. GPT-4 was the inflection point: rumours had GPT-4 as MoE before it was confirmed, and every subsequent OpenAI model has been MoE. The shift moved frontier-class quality from "$30 per million tokens" economics to "$10 per million" economics in a single model generation.

Beyond MoE, the inference stack itself improved dramatically. Continuous batching (handling many concurrent requests on the same GPU efficiently), paged attention (reducing memory waste in the KV cache), speculative decoding (using a small draft model to predict tokens that the full model verifies), and FP8/INT8 quantization (running on cheaper precision without quality loss) all compound. A 2026 inference stack on the same hardware delivers 4-6x the throughput of a 2023 stack.

Distillation

OpenAI's small models, GPT-4o-mini, GPT-5 mini, GPT-5 nano, are not just smaller dense models. They're distilled from frontier teachers, meaning they learn to imitate the frontier model's outputs rather than starting from scratch. This is why mini-class models have quality that doesn't track linearly with their parameter count: they inherit the teacher's reasoning patterns at a fraction of the inference cost. Distillation is what makes the $0.05/1M tier possible without the model being useless.

Competition

Anthropic, Google, and the open-weight ecosystem (DeepSeek, Llama, Qwen) all reached competitive parity through 2024-2025. With four credible providers selling roughly comparable quality, prices compressed to roughly comparable per-token rates. The fact that OpenAI, Anthropic, and Google all charge $1.25-$3.00 / 1M tokens for their flagships is not coincidence, it's market equilibrium. DeepSeek-V3 sells equivalent quality for $0.27/1M because they have lower distribution costs and a different revenue model, but even DeepSeek's pricing pulled OpenAI's downward.

What gets billed beyond the per-token rate

Three line items most users miss.

Cached input tokens are billed at 50% of normal rates if the prompt prefix matches a recent request. Most chatbots and RAG pipelines have stable system prompts, using cached pricing on those should be the default. OpenAI doesn't auto-enable caching; you have to structure prompts to take advantage. The savings can be 30-40% on the right workload.

Reasoning tokens are an OpenAI-specific concept introduced with the o-series. The model produces internal reasoning text before the visible answer, and you pay for those tokens at the standard output rate. A query that returns a 300-token visible answer might bill 5,000 reasoning tokens. This is invisible until your bill comes in. If your application uses reasoning models, instrument your code to track reasoning-token spend separately.

Realtime audio tokens bill differently. The Realtime API charges per minute of audio input/output, not per token. The math at typical voice-agent volumes is similar to text-token economics, but if you're modelling costs for a voice product, use the audio pricing tables, not text.

What this predicts for 2026-2027

Two predictions with high confidence and one with low.

High confidence: per-token prices keep falling. The structural drivers (MoE, distillation, competition) all still apply. Expect GPT-5-mini-class quality at GPT-5-nano prices by mid-2027. Expect the cheapest viable production tier to be sub-$0.02 / 1M output tokens by end of 2026 (GPT-5 nano is at $0.40 today, so this is a 20x cut, which is actually slightly slower than the historical pace).

High confidence: caching becomes default-on. The pricing complexity of cached vs uncached tokens is a UX wart. Expect OpenAI (and probably others) to auto-cache and bill the cached rate when applicable, hiding the complexity from users. This is a 1-2 year improvement, not a 5-year one.

Low confidence: free-tier inference becomes broadly available. Models cheap enough to give away exist. Whether the major providers expand free tiers to capture distribution depends on whether the strategic value of free distribution exceeds the direct revenue cost. OpenAI's history suggests they will (ChatGPT free tier was a strategic masterstroke). Anthropic's history suggests they won't (they've kept consumer free tiers limited). Google could go either way.

How to actually use this

If you're budgeting for AI inference in 2026, three rules:

Budget against last quarter's pricing. You'll be under, every time. Pricing falls faster than budgets adjust. If you're scoping a six-month project, assume per-token costs will be 30% lower at the end than at the start, and plan for the headroom to ship features rather than hoard the savings.

Default to mid-tier. GPT-5 mini is 80% of GPT-5's quality at 20% of the cost. For most workloads, that's the right starting point. Reach for full GPT-5 only when an eval shows you need the extra capability. Reach for GPT-5 nano only when cost is the binding constraint and quality is genuinely flexible.

Track input vs output ratio. Workloads that are 95% input (RAG, summarization) cost dramatically less than the headline rate suggests. Workloads that are 95% output (long-form generation, agent loops) cost dramatically more. Budget against your actual ratio, not against the worst case.

The deeper takeaway

OpenAI's pricing trajectory is the most legible example of a broader pattern: frontier inference economics are still improving by ~90% per year, and the market structure is consolidating into a small number of credible providers competing on per-token price. The teams that win are the ones treating LLM pricing as a variable to optimize, not a fixed cost, running multi-provider, swapping per workload, and re-evaluating quarterly.

If you're paying GPT-4 prices for GPT-4 quality in 2026, you're paying twice what you should. The savings are immediate; the engineering work is small.

Further reading

Keep reading

Friday digest

Intelligence, distilled weekly.

One short email every Friday, new model launches, leaderboard moves, and pricing drops. Curated by hand. Free, no spam.