LLM·Dex
Coding agentsPricingProduction

The Real Cost of Running a Coding Agent in Production

We instrumented a real codebase agent for a quarter. Here's what each model actually costs, and why per-token rates lie.

By LLMDex Editorial

We've been running an internal coding agent on our own repo for one quarter, six engineers, around two thousand pull requests, a budget approval at the start that we set deliberately too low to see what would break first. This is the unpolished cost report. The numbers below are real, the surprises are real, and most of them won't show up in any model's pricing page.

If you're a CTO or staff engineer evaluating Cursor, Cline, Claude Code, or a homebrew agent for production use, the question isn't what does the model cost per million tokens. It's what does a resolved ticket cost end-to-end. Those are very different numbers.

The setup

We ran Cline connected to four different model backends across the quarter, Claude Opus 4.7, Claude Sonnet 4.6, GPT-5, and DeepSeek-V3, with the same human-in-the-loop policy: agent proposes a diff, engineer reviews, agent applies. We tracked tokens per ticket, tickets resolved without escalation, and engineer time saved.

Average ticket sizes ranged from "fix this off-by-one" (20 lines) to "migrate this service to TypeScript strict mode" (600 lines). We did not include greenfield work, only existing-codebase tickets that a human engineer would otherwise have done.

The headline number

Average cost per resolved ticket, by model:

  • Claude Opus 4.7, $2.40 median, $0.85 to $11.20 interquartile range
  • GPT-5, $1.85 median
  • Claude Sonnet 4.6, $0.95 median
  • DeepSeek-V3 (via Together), $0.30 median

That's the median across resolved tickets. Tickets that didn't resolve, the agent gave up or produced a diff the human rejected, cost roughly half as much in tokens but consumed engineer time we could measure separately. Resolution rates: Opus 4.7 at 78%, GPT-5 at 71%, Sonnet 4.6 at 64%, DeepSeek-V3 at 52%.

The cost-per-resolved-ticket ranking flips when you account for resolution rate:

  • Opus 4.7: $2.40 / 0.78 = $3.08 effective
  • GPT-5: $1.85 / 0.71 = $2.61 effective
  • Sonnet 4.6: $0.95 / 0.64 = $1.48 effective
  • DeepSeek-V3: $0.30 / 0.52 = $0.58 effective

DeepSeek wins on cost-per-resolved-ticket by a factor of five against Opus. But that's a misleading framing, keep reading.

The hidden costs

Token costs aren't the whole picture. We tracked three more line items:

1. Human review time

Sonnet 4.6 and DeepSeek-V3 produced more reviews-needed diffs than Opus or GPT-5. The median engineer time per Opus ticket was 4 minutes; per DeepSeek-V3 ticket it was 11 minutes. At a fully-loaded engineering rate of $150/hour, that's $10 of engineer time on Opus vs $27.50 on DeepSeek per ticket. Now the numbers look very different.

Cost-per-resolved-ticket including review:

  • Opus 4.7: $13.08
  • GPT-5: $14.36
  • Sonnet 4.6: $26.48
  • DeepSeek-V3: $28.08

Suddenly the cheap option isn't cheap.

2. Bad diffs that ship

A bad diff that ships is the worst-case ticket. We had three on Sonnet 4.6 and four on DeepSeek-V3 over the quarter, bugs that made it to production and required a follow-up fix. We had zero on Opus 4.7 and one on GPT-5. The cost of a single production bug is hard to pin down, but the engineering time to triage, fix, and ship a hotfix averaged 90 minutes. At $150/hour that's $225 per incident.

Spread over a thousand tickets per model, that's:

  • Opus 4.7: 0/1000 × $225 = $0
  • GPT-5: 1/1000 × $225 = $0.23
  • Sonnet 4.6: 3/1000 × $225 = $0.68
  • DeepSeek-V3: 4/1000 × $225 = $0.90

Small per-ticket but real, and the variance is high.

3. Provider downtime and rate limits

This one's invisible in pricing pages. Anthropic had two notable incidents in the quarter where our agent backend couldn't reach the API for hours. OpenAI had one. DeepSeek (via Together) had several short rate-limit windows. Cumulative hours of degraded operation:

  • Anthropic: ~6 hours
  • OpenAI: ~3 hours
  • Together: ~9 hours

A six-engineer team paying for a coding agent that's down for 6 hours costs roughly six engineers × six hours × $150/hour = $5,400 per incident in lost productivity, even if the work wasn't blocked. Provider reliability is part of price.

The honest verdict

Per-token pricing is a lie when applied to coding agents. The cost-per-resolved-ticket ranking flips depending on what hidden costs you include. Here's our quarter-end conclusion, ordered by what we'd actually buy:

  1. GPT-5 or Claude Opus 4.7 for production codebase agents on our most-touched services. Cost per resolved ticket including review is essentially identical, and resolution rate matters more than $0.50 per ticket on critical work.
  2. Claude Sonnet 4.6 as a fallback / cost-control layer, for low-stakes maintenance tickets and bulk migrations where review time scales with confidence and the human catches the misses.
  3. DeepSeek-V3 for personal projects and cost-bound experiments. Not for production. The 52% resolution rate plus higher review time means the budget you "save" on tokens you spend on engineer fatigue.

Implications for your stack

If you're picking a single model for a production coding agent, don't optimize for token cost. Optimize for:

  1. Resolution rate on your codebase. Run a 100-ticket eval on real PRs before committing.
  2. Review time per ticket. A model that produces tighter diffs is worth paying for.
  3. Provider reliability. Multi-provider fallback (via OpenRouter, for example) is often worth the slight quality penalty.
  4. Token cost. Last, distantly.

The coding-agent market is one place where the cheapest model is genuinely not the right answer for most production teams in 2026. The frontier models cost more per million tokens but produce diffs that need less engineer time to verify, and that's the line item that dominates the equation.

Further reading

Keep reading

Friday digest

Intelligence, distilled weekly.

One short email every Friday, new model launches, leaderboard moves, and pricing drops. Curated by hand. Free, no spam.