LLM·Dex
Self-hostingEconomicsInfrastructure

The Real Economics of Self-Hosting LLMs in 2026

When self-hosting beats commercial APIs on cost, when it doesn't, and the operational realities most teams underweight.

By LLMDex Editorial

Self-hosting an LLM looks dramatically cheaper than commercial APIs on the back of an envelope. Open-weight models are free; commodity GPUs are 30-50% the cost of equivalent cloud rental over 18 months; a well-tuned vLLM stack delivers thousands of tokens per second. A team that's spending $50K/month on OpenAI API calls reads those numbers and concludes that self-hosting will save 70% of that bill.

Sometimes it does. Often it doesn't, because the back-of-envelope math leaves out costs that turn out to dominate. This article is the actual cost model, including the parts everyone forgets to count.

The visible costs

Three line items that everyone includes in their model:

Hardware. A single H100 80GB capex is ~$30K-40K depending on configuration. Eight-GPU servers (H100 SXM, NVL configurations) range from $250K-400K. Operational lifetime is 4-5 years, so amortized capex is roughly $7K-10K per H100-year.

Power and colocation. An H100 rack draws ~10kW. At commercial colo rates ($150-300/kW/month), that's $1,500-3,000/month per rack. Add 20-30% for the connectivity overhead. Roughly $25K/year per rack of all-in colo.

Inference engineer time. Setting up vLLM, tuning throughput, monitoring, and on-call adds up. A senior infrastructure engineer who can keep a multi-GPU LLM serving stack healthy is roughly $200K-300K fully loaded. Even at 25% of one engineer's time on this work, that's $50K-75K/year.

Add these up for a single 8-GPU rack serving a 70B model: roughly $130K-180K/year all-in. That serves maybe 100-200 concurrent users at production-grade quality.

The invisible costs

Three line items that most teams underweight or omit:

Failover capacity

A single rack is a single point of failure. A serious deployment needs at least N+1 redundancy, two racks, with the second taking traffic if the first fails. Failover capacity is 100% overhead on the hardware cost. Suddenly your $130K/year is $260K/year.

For mission-critical deployments, you actually want N+2 (two failover racks) so you can take one down for maintenance without losing redundancy. That's another 50% overhead.

This is the line item that surprises teams most. Single-machine inference is fine for prototypes and IP-sensitive workloads where downtime is acceptable. Production deployments need dramatically more capacity than the back-of-envelope math suggests.

Model updates and re-evaluation

Every 6-12 months, a new generation of open-weight model ships. Llama 3.3 → Llama 4 → Llama 4.1, etc. Staying on a current generation requires:

  • Re-validating quality on your evals
  • Re-benchmarking throughput on your hardware
  • Quantizing the new weights
  • Updating the serving stack
  • Possibly tuning prompts that worked on the old model

This is roughly two engineer-weeks per generation. Over a 4-year hardware lifetime, that's 4-8 weeks of engineer time per model upgrade cycle. If you fall behind, your self-hosted model is meaningfully worse than the commercial alternative, eroding the original justification for self-hosting.

Throughput inefficiency

Commercial APIs achieve high throughput by aggregating traffic from many customers. Single-customer self-hosted deployments don't get this, your traffic is bursty, and the GPU is idle most of the time at low usage and queue-bottlenecked at peak.

Realistic GPU utilization on dedicated single-customer deployments runs 30-50%. That means 50-70% of your capex+opex is paying for capacity you're not using. The commercial API's multi-tenant architecture is inherently more efficient.

Compliance and operational overhead

If you're self-hosting because of compliance requirements (EU data residency, healthcare PHI, defense-related work), you also inherit the compliance costs the commercial API provider would have absorbed. SOC 2 audit, ISO 27001, security patching cadence, vulnerability scanning, these add real engineer time and external audit fees. Call it $30K-100K/year for a serious compliance program.

Where self-hosting actually wins

Three workload profiles where the math genuinely favors self-hosting:

Very high volume. Above ~50M tokens/day sustained, the per-token math shifts in favor of self-hosting even with all the invisible costs counted. The fixed costs of self-hosting amortize across enough volume that the per-token cost drops below commercial rates.

Compliance-binding. When you literally cannot send data to a commercial API, the comparison isn't "self-hosted vs commercial." It's "self-hosted vs nothing." The cost is whatever it is, and you pay it.

IP-sensitive and code-shaped. Some teams (especially in defense, security research, and certain enterprises) won't send code to commercial APIs. Self-hosting is the only viable AI-coding-assistant path. The cost is justified by enabling the work, not by direct savings.

Predictable, sustained load. A workload that runs at a predictable level 24/7 (rather than spiky human-driven traffic) gets better GPU utilization on self-hosted, narrowing the throughput-inefficiency gap with commercial APIs.

Where self-hosting loses

Five workload profiles where commercial APIs win on real cost:

Spiky low-medium volume. Most production workloads. The GPU sits idle 60%+ of the time; capex is paying for capacity you're not using.

Need for frontier-class quality. No open-weight model matches GPT-5.5 or Claude Opus 4.7 on the hardest workloads. If you genuinely need frontier quality, you're paying for the commercial API regardless.

Multimodal-heavy. Open-weight vision/voice models exist but lag closed-frontier. For multimodal-shaped products, the quality gap is large enough that the cost savings don't compensate.

Small teams. Below ~5 engineers, the operational tax of running a self-hosted serving stack costs more than the API savings. Stay on commercial.

Rapid iteration. Early-stage products where the model choice changes every few weeks. The setup cost of self-hosting doesn't amortize across enough usage to pay back before you switch models again.

The hybrid pattern

Most teams that self-host successfully don't self-host everything. The pattern that works:

Self-host the high-volume base case. A 70B-class model running on dedicated infrastructure for 80% of queries. Cheap per token, predictable.

Route to commercial API for hard cases. When the self-hosted model's confidence is low or the query is in a category where the open-weight model is weak, fall back to a commercial API.

Route to commercial flagship for premium tier. Free-tier users get self-hosted; paid users get GPT-5.5 or Claude Opus.

This hybrid pattern captures most of the cost savings of self-hosting while preserving access to frontier capability when it matters. Implementation is straightforward, a routing layer in front of both endpoints.

A worked cost model

Concrete example. A SaaS product with 200K daily active users, average 50 LLM calls per active user per day, average 2K tokens per call. Total: 20B tokens/day.

Pure commercial (Claude Sonnet 4.6 at ~$3/$15 per 1M):

  • Input ~80% of tokens: 16B × $3 / 1B = $48,000/day = $1.4M/month
  • Output ~20%: 4B × $15 / 1B = $60,000/day = $1.8M/month
  • Total: ~$3.2M/month.

Pure self-hosted (Llama 4 70B AWQ on 8-GPU H100 racks):

  • Throughput per rack: ~5K tokens/sec sustained = 432M tokens/day
  • Need ~50 racks for 20B tokens/day with N+1 failover
  • Capex amortized: 50 × $250K / 4 years = $3.1M/year
  • Colo + power: 50 × $25K = $1.25M/year
  • Engineering: 2 senior engineers full-time = $600K/year
  • Total: ~$5M/year = $420K/month.

So self-hosting at this scale is ~$420K/month vs ~$3.2M/month commercial. Real savings of ~85%. But quality is lower (Llama 4 70B is good but not Sonnet 4.6), and operational complexity is significantly higher.

Hybrid (self-hosted for 80% of queries, Sonnet 4.6 for the hard 20%):

  • Self-hosted infrastructure for 80%: ~$340K/month
  • Commercial API for 20%: ~$640K/month
  • Total: ~$980K/month, with frontier quality on hard queries.

The hybrid pattern is the right answer at this scale. 70% cost savings vs pure commercial; frontier quality preserved on the queries that matter.

Concrete recommendation

If you're considering self-hosting, three rules:

  1. Don't self-host below 5M tokens/day. The fixed costs don't amortize. Stay commercial.
  2. For 5-50M tokens/day with compliance binding, self-host with the hybrid pattern. Self-hosted for the base case, commercial for hard queries.
  3. Above 50M tokens/day, the math reliably favors self-hosting, but invest in real infrastructure operations, not as a side project for your platform team.

The math at scale is real. The math at small scale is misleading. Match the architecture to the actual volume.

Further reading

Keep reading

Friday digest

Intelligence, distilled weekly.

One short email every Friday, new model launches, leaderboard moves, and pricing drops. Curated by hand. Free, no spam.