When to Fine-Tune vs. Prompt vs. Tools
Technology & Software • ~8–9 min read • Updated Apr 25, 2025
Context
Teams often default to the shiniest lever—fine-tuning—when prompting or tool use would deliver faster, cheaper, and safer results. This essay offers a simple, testable framework you can run in a week to decide whether your next capability should lean on prompting, tools/RAG, or fine-tuning.
Core Framework
Decide with five product realities. If you answer “Yes” to a line, prefer the option at right.
- Does required knowledge change weekly? → Tools/RAG (externalize facts; update sources, not weights).
- Do you need deterministic steps, integrations, or calculators? → Tools (functions, APIs, workflow engines).
- Is behavior mostly formatting or tone? → Prompting (system/role prompts, exemplars, output schemas).
- Do you need style/voice or domain behaviors that are stable? → Light fine-tune (adapters/LoRA on instruction-follow).
- Do you have 1k–20k high-quality examples for a narrow task? → Task fine-tune (supervised signals beat prompt hacks).
Decision Tree (90 seconds)
- Step 1: If knowledge freshness > monthly or sources are private → start with RAG + Prompting.
- Step 2: Add Tools when you need actions: lookups, calculations, CRUD, or compliance checks.
- Step 3: Consider Fine-tuning only when behavior is stable, labeled data is available, and you need lower latency/cost or stronger guardrail adherence than prompting delivers.
Design Notes
- Prompting: Use structured outputs (JSON schemas), few-shot exemplars, and content policies. Version prompts alongside evaluation sets.
- Tools/RAG: Ground claims with citations & source IDs; log retrieval stats (recall@k, coverage, answerability). Keep your tool layer idempotent and auditable.
- Fine-tuning: Start with adapters (LoRA/IA3) and small targets. Label a golden set first; tune after you plateau on prompting and tools.
Recommended Actions
- Instrument a Benchmark: Build a 50–100 item golden set with rubric scoring (accuracy, policy adherence, latency).
- Baseline with Prompting: Try 2–3 prompt patterns (schema-first, exemplar-first, policy-first). Keep the best.
- Add RAG/Tools: Introduce retrieval and 1–2 critical functions; re-run the benchmark. Compare gains vs. added latency.
- Trial a Small Fine-Tune: Use adapters on 1–2 narrow intents with 1k+ labeled examples. Measure incremental gain per $.
- Make the Call: Pick the cheapest architecture that passes the bar. Ship, observe, and revisit quarterly.
Common Pitfalls
- Premature fine-tuning: Tuning before you’ve stabilized prompts and tool boundaries.
- Unlabeled evals: Debating “looks better” without a rubric and golden set.
- Tool sprawl: Many functions, no observability; latency balloons and failures are opaque.
- Static behavior: Hard-coding policies in prompts that should live in tools or governance.
Quick Win Checklist
- Publish a one-pager decision tree and share it with product & risk.
- Run a one-week bakeoff: Prompt-only → +RAG/Tools → +Fine-tune; choose on metrics.
- Version prompts, tools, and models as a set with a single release tag.
Closing
The right lever depends on how fast your facts change, what you must control, and what you can measure. Start with prompts, add tools for action, and fine-tune to lock in stable behaviors at scale.