LLM·Dex
Fine-tuningPromptingRAG

Should You Fine-Tune or Just Prompt?

A decision framework, with cost numbers, for when fine-tuning beats clever prompting in 2026.

By LLMDex Editorial

The "should I fine-tune" question has changed three times in the LLM era. In 2022, fine-tuning was the only way to get domain-specific behavior, prompting was too unreliable. In 2024, prompting and RAG had improved enough that fine-tuning was widely deemed obsolete for most applications. In 2026, with cheap LoRA on small frontier-quality base models, fine-tuning is making a quiet comeback. This article is a decision framework for which option to pick today.

The three options

You have three ways to specialize a model's behavior:

  1. Better prompting. System prompts, few-shot examples, chain-of-thought guidance. Free, instant, the obvious first step.
  2. RAG (retrieval-augmented generation). External knowledge stored in a vector database, retrieved at inference time. Cheap, fast to update, doesn't change the model.
  3. Fine-tuning. Modifying the model's weights via LoRA, QLoRA, or full fine-tune. Slower to iterate, harder to update, but changes the model's "default" behavior.

Most teams start with #1, escalate to #2 when knowledge is the bottleneck, and reach for #3 only when behavior, not knowledge, needs to change.

When prompting alone is enough

If your problem is one of these, don't fine-tune. Don't even RAG. Prompt engineering will get you 95% of the way:

  • Format / structure. Output JSON, XML, a specific shape. Prompt + structured-output mode handles this.
  • Tone. Brand voice, formal vs casual, terse vs verbose. Few-shot examples + clear system prompt.
  • Standard reasoning. Code review, math, writing tasks where the model already knows the domain.
  • Single-turn classification. Routing user queries to one of N categories.

The cost is zero, the iteration cycle is seconds, and modern models are remarkably steerable. We'd default to prompting alone for any new project that doesn't fail one of the criteria below.

When you should reach for RAG

Three signals that prompting isn't enough but fine-tuning is overkill:

1. Your knowledge is updating constantly

Customer support docs change weekly. Product data changes daily. Fine-tuning a model on this kind of data is always out of date. RAG with a vector database lets you update the source corpus instantly.

2. You need citations

If your application requires the model to cite a specific document, legal, medical, financial workflows, RAG gives you a clean source-of-truth chain. Fine-tuning doesn't.

3. The corpus is large

If you have 10K+ documents, you can't fit them in any context window. RAG is the only architecture that scales.

For most "AI assistant on top of our internal docs" products in 2026, RAG is the right answer. The pipeline is well-understood, the tooling is mature (Pinecone, Weaviate, Chroma), and modern long-context models like Gemini 3 Pro make the synthesis step easy.

When fine-tuning genuinely wins

Fine-tuning beats prompt-and-RAG on three workloads:

1. Style and tone you can't prompt

A specific writer's voice, a domain-specific register (legal English, medical English), or a structured output format that the base model resists. With 1,000 high-quality examples, a LoRA fine-tune captures the style in a way few-shot examples can't.

2. Reasoning patterns

Specialized reasoning chains, say, the specific way your senior analyst approaches a financial diligence question, can be taught via fine-tuning more reliably than via prompting. The model internalizes the pattern and applies it consistently.

3. Cost reduction at scale

A fine-tuned 8B model that performs 90% as well as GPT-5 on your specific task may be 20× cheaper to serve at high volume. For high-throughput, narrow-purpose workloads (classification, extraction, simple generation), fine-tuning a small model is genuinely the right call.

What it costs in 2026

Three real-world fine-tune pricings:

  • OpenAI hosted fine-tuning on GPT-5 mini: roughly $25 per million training tokens, plus a 1.5× multiplier on inference. For a 50K-example dataset (~10M tokens), that's ~$250 to train.
  • Self-hosted LoRA on Llama-4-70B with Together's fine-tuning API: roughly $4-8 per million training tokens. Same dataset, ~$40-80 to train.
  • Self-hosted on your own GPUs (Axolotl, Unsloth): hardware cost varies. For a serious team with existing GPU capacity, dollar-cost is negligible; engineer-time is the real cost.

The cost of iterating matters more than the cost of one training run. Plan for 5-10 training runs over a few weeks before you have a model worth deploying.

The decision framework

Here's the flowchart we actually use:

  1. Try better prompting first. Spend a day. Measure on a held-out set.
  2. If quality is still bad, ask: is this a knowledge problem? If yes, RAG. If no, continue.
  3. Is the issue style/tone/format the model resists? If yes, fine-tune. If no, continue.
  4. Is per-token cost a binding constraint at expected volume? If yes, fine-tune a smaller model. If no, stop.
  5. You're done. Use prompting + maybe RAG.

In our experience over multiple production deployments, this framework keeps roughly 80% of projects in the "prompt-only" or "prompt + RAG" buckets. Fine-tuning is real and useful, but it's a heavier tool than its 2022 reputation suggests.

What to fine-tune on

If you've decided to fine-tune, the base model matters:

  • For style and tone, frontier mid-tier (GPT-5 mini, Claude Sonnet) via hosted APIs. Faster iteration, less ops.
  • For cost reduction, open-weight 7-13B (Llama-4-8B, Qwen-2.5-7B, Phi-4). Self-host on commodity hardware.
  • For reasoning, DeepSeek-R1 or one of the open reasoning bases. The reasoning post-training is the value.
  • For code, Qwen-2.5-Coder-32B or Codestral-2 if license allows. They're already code-specialized.

Browse the Best LLMs for fine-tuning ranking for the full list.

Common mistakes

Three pitfalls we've seen burn teams:

  1. Tiny datasets. Fine-tuning on 50 examples does almost nothing. You need 500-5,000 high-quality examples for most tasks.
  2. Bad data quality. Garbage in, garbage out. A small high-quality dataset beats a large low-quality one every time.
  3. Skipping the eval. Fine-tuning without a held-out eval set is gambling. Build the eval first, fine-tune second.

The honest answer for most readers

If you're a small team or solo developer asking this question for the first time: start with prompting. Add RAG when you have a real knowledge problem. Reach for fine-tuning only when you've exhausted both and you have enough data + budget + patience to iterate.

If you're a larger team with a high-volume narrow-purpose workload: fine-tuning a small model is genuinely the right call in 2026. The economics work and the tooling is mature.

The middle case, medium-volume general-purpose work, almost always lands at prompting plus RAG. That's where the bulk of production AI lives.

Further reading

Keep reading

Friday digest

Intelligence, distilled weekly.

One short email every Friday, new model launches, leaderboard moves, and pricing drops. Curated by hand. Free, no spam.