Choosing a Local Model#
The most expensive mistake in local LLM adoption is running a 70B model for a task that a 3B model handles at 20x the speed for equivalent quality. The second most expensive mistake is running a 3B model on a task that requires 32B-level reasoning and getting garbage output.
Matching model size to task complexity is the core skill. This guide provides a framework grounded in empirical benchmarks, not marketing claims.
Model Size Tiers#
Tier 1: Small (2-7B Parameters)#
Memory: 2-5 GB (Q4 quantized) Speed: 30-100 tokens/second on Apple Silicon Cost: $0 (local) vs $0.001-0.01 per call (cloud)
What they do well:
- Structured extraction (parse text into JSON fields)
- Classification and routing (categorize inputs into predefined labels)
- Function calling (select a tool and fill parameters from a small schema)
- Summarization (compress focused inputs into shorter text)
- Format conversion (Markdown to JSON, log to structured event)
- Validation and gatekeeping (check schema compliance, input safety)
What they do poorly:
- Multi-step reasoning (chaining logical deductions)
- Cross-file analysis (understanding relationships across many files)
- Nuanced code review (catching subtle bugs that require deep understanding)
- Open-ended generation (creative writing, complex explanations)
Recommended models:
| Model | Parameters | Strength |
|---|---|---|
| Qwen3-4B | 4B | Best all-rounder at this size; matches 120B teacher on 7/8 tasks when fine-tuned |
| Ministral-3B | 3B | Purpose-built for function calling and JSON output |
| Phi-3-mini | 3.8B | Strong reasoning for its size |
| Llama 3.2-3B | 3B | Solid baseline, widely supported |
| Gemma-2-2B | 2B | Google’s smallest, good for classification |
Tier 2: Medium (13-32B Parameters)#
Memory: 10-22 GB (Q4 quantized) Speed: 15-40 tokens/second on Apple Silicon Cost: $0 (local) vs $0.003-0.03 per call (cloud)
What they add over small models:
- Multi-file reasoning (understanding how components relate)
- Code review with context (catching bugs, suggesting improvements)
- Complex summarization (preserving nuance across long inputs)
- Architecture analysis (identifying patterns and anti-patterns)
- Refactoring suggestions (proposing structural changes with rationale)
Recommended models:
| Model | Parameters | Strength |
|---|---|---|
| Qwen 2.5 Coder 32B | 32B | Best local model for code: compilation correctness, refactoring, review |
| DeepSeek Coder V2 | 16B/236B MoE | Strong code generation, efficient MoE architecture |
| CodeLlama 34B | 34B | Meta’s code-focused model |
The daily driver. A 32B model is the practical ceiling for “always loaded” on 48-64GB machines. It handles 80% of coding tasks with quality approaching cloud models.
Tier 3: Large (70B+ Parameters)#
Memory: 40-52 GB (Q4 quantized) Speed: 5-15 tokens/second on Apple Silicon Cost: $0 (local) vs $0.01-0.06 per call (cloud)
What they add over medium models:
- Complex multi-step reasoning
- Deep architectural analysis across large codebases
- Subtle bug detection requiring broad context
- Natural language quality approaching cloud models
Recommended models:
| Model | Parameters | Strength |
|---|---|---|
| Llama 3.3 70B | 70B | Best reasoning at this size, strong code understanding |
| Qwen 2.5 72B | 72B | Competitive with Llama, good for multilingual |
| DeepSeek R1 70B | 70B | Reasoning-focused, good for complex analysis |
Load on demand. 70B models consume most of your memory. Stop your 32B daily driver first, load the 70B for the specific complex task, then switch back.
Task-Model Matching#
The decision flowchart:
Is the output structured (JSON, classification, tool call)?
└── YES → Can you define the exact output schema?
└── YES → Use 3-4B model (Qwen3-4B, Ministral-3B)
└── NO → Use 7B model (Qwen 2.5 Coder 7B)
Is the task single-file analysis?
└── YES → Is it extraction or summarization?
└── YES → Use 7B model
└── NO (review, refactoring) → Use 32B model
Is the task multi-file analysis?
└── YES → Can you summarize files first, then correlate?
└── YES → Use 7B for summaries + 32B for correlation
└── NO (need full context) → Use 32B or 70B
Is the task complex reasoning or architecture-level?
└── YES → Use 70B locally or escalate to cloud APIEmpirical Results#
From benchmarking across structured extraction, classification, function calling, and summarization:
| Task | 3-4B Quality | 7B Quality | 32B Quality | Cloud (GPT-4/Claude) |
|---|---|---|---|---|
| JSON extraction | 85-92% | 90-95% | 95-98% | 97-99% |
| Classification | 80-90% | 88-95% | 93-97% | 96-99% |
| Function calling | 75-88% | 85-93% | 92-97% | 95-99% |
| Summarization | 70-80% | 80-88% | 88-93% | 93-97% |
| Code review | 40-55% | 55-70% | 75-85% | 88-95% |
| Multi-file reasoning | 20-35% | 40-55% | 65-80% | 85-95% |
These ranges reflect variation across models within each tier and across task difficulty. The key insight: small models match or approach cloud quality on constrained tasks, but fall off sharply on open-ended reasoning.
Cost Comparison#
Per-Call Cost#
| Provider | Model | Input Cost (1K tokens) | Output Cost (1K tokens) | Total (typical call) |
|---|---|---|---|---|
| Local (Ollama) | Qwen3-4B | $0 | $0 | $0 |
| Local (Ollama) | Qwen 2.5 Coder 32B | $0 | $0 | $0 |
| Local (Ollama) | Llama 3.3 70B | $0 | $0 | $0 |
| Anthropic | Claude Sonnet 4.5 | $0.003 | $0.015 | ~$0.003 |
| Anthropic | Claude Opus 4.6 | $0.015 | $0.075 | ~$0.015 |
| OpenAI | GPT-4o | $0.005 | $0.015 | ~$0.005 |
A typical extraction call processes ~500 input tokens and generates ~200 output tokens. At 1000 calls/day:
- Local 4B: $0/day, $0/month
- Claude Sonnet: ~$3/day, ~$90/month
- Claude Opus: ~$15/day, ~$450/month
Hardware Amortization#
The hardware cost is real but amortized:
| Hardware | Cost | Monthly Amortized (3yr) | Models Supported |
|---|---|---|---|
| Mac Mini M4 Pro 48GB | ~$1,800 | ~$50/mo | Up to 32B daily driver |
| Mac Mini M4 Pro 64GB | ~$2,200 | ~$61/mo | Up to 70B on demand |
| Linux + RTX 4090 (24GB) | ~$2,500 | ~$69/mo | Up to 32B |
| Linux + 2x RTX 4090 | ~$4,500 | ~$125/mo | Up to 70B |
At 1000+ calls/day, local inference pays for itself within months compared to cloud APIs. At lower volumes, the convenience and quality of cloud APIs may justify the cost.
When to Use Cloud Instead#
Local models are not always the right choice:
- Task requires frontier-model reasoning. Complex multi-step analysis where 70B local is not good enough.
- Latency budget is tight. Cloud APIs can have lower time-to-first-token due to optimized serving infrastructure.
- Volume is low. Under ~100 calls/day, the hardware cost is not justified.
- You need the latest capabilities. Cloud models are updated frequently. Local models lag by weeks to months.
- Compliance requires specific providers. Some regulated environments mandate specific cloud providers with BAAs and certifications.
The Hybrid Strategy#
The most practical approach is not “local only” or “cloud only” — it is routing by task:
Incoming task
│
├── Structured extraction → Local 3-4B (instant, free)
├── Classification/routing → Local 3-4B (instant, free)
├── File summarization → Local 7B (fast, free)
├── Code review → Local 32B (good, free)
├── Multi-file correlation → Local 32B (good, free)
├── Complex architecture → Local 70B (slower, free)
└── Frontier reasoning → Cloud API (best quality, paid)Route the 80% of tasks that are structured and constrained to small local models. Reserve cloud APIs for the 20% that genuinely need frontier intelligence. This is 10-30x cheaper than sending everything to the cloud while maintaining quality where it matters.
Common Mistakes#
- Using 32B for everything. A 32B model doing JSON extraction is like using a forklift to carry a grocery bag. The 4B model is faster, uses less memory, and produces equivalent output for constrained tasks.
- Dismissing small models based on general benchmarks. General benchmarks (MMLU, HumanEval) test broad reasoning. Your extraction task is a narrow, constrained problem where small models excel. Test on your actual task, not on benchmarks.
- Not testing quantization levels. The default Q4_K_M quantization is good but not always optimal. For tasks where quality is borderline, trying Q5_K_M can push a smaller model over the threshold, avoiding the need to step up a tier.
- Ignoring cold start time. The first call after loading a model is slower (model loads from disk to GPU). For latency-sensitive applications, keep the model warm with periodic pings.
- Comparing local model quality on creative tasks. Local models lag behind cloud models on open-ended generation. But most agent workflows are not creative — they are structured operations where local models are competitive.