Choosing a Local Model#

The most expensive mistake in local LLM adoption is running a 70B model for a task that a 3B model handles at 20x the speed for equivalent quality. The second most expensive mistake is running a 3B model on a task that requires 32B-level reasoning and getting garbage output.

Matching model size to task complexity is the core skill. This guide provides a framework grounded in empirical benchmarks, not marketing claims.

Model Size Tiers#

Tier 1: Small (2-7B Parameters)#

Memory: 2-5 GB (Q4 quantized) Speed: 30-100 tokens/second on Apple Silicon Cost: $0 (local) vs $0.001-0.01 per call (cloud)

What they do well:

Structured extraction (parse text into JSON fields)
Classification and routing (categorize inputs into predefined labels)
Function calling (select a tool and fill parameters from a small schema)
Summarization (compress focused inputs into shorter text)
Format conversion (Markdown to JSON, log to structured event)
Validation and gatekeeping (check schema compliance, input safety)

What they do poorly:

Multi-step reasoning (chaining logical deductions)
Cross-file analysis (understanding relationships across many files)
Nuanced code review (catching subtle bugs that require deep understanding)
Open-ended generation (creative writing, complex explanations)

Recommended models:

Model	Parameters	Strength
Qwen3-4B	4B	Best all-rounder at this size; matches 120B teacher on 7/8 tasks when fine-tuned
Ministral-3B	3B	Purpose-built for function calling and JSON output
Phi-3-mini	3.8B	Strong reasoning for its size
Llama 3.2-3B	3B	Solid baseline, widely supported
Gemma-2-2B	2B	Google’s smallest, good for classification

Tier 2: Medium (13-32B Parameters)#

Memory: 10-22 GB (Q4 quantized) Speed: 15-40 tokens/second on Apple Silicon Cost: $0 (local) vs $0.003-0.03 per call (cloud)

What they add over small models:

Multi-file reasoning (understanding how components relate)
Code review with context (catching bugs, suggesting improvements)
Complex summarization (preserving nuance across long inputs)
Architecture analysis (identifying patterns and anti-patterns)
Refactoring suggestions (proposing structural changes with rationale)

Recommended models:

Model	Parameters	Strength
Qwen 2.5 Coder 32B	32B	Best local model for code: compilation correctness, refactoring, review
DeepSeek Coder V2	16B/236B MoE	Strong code generation, efficient MoE architecture
CodeLlama 34B	34B	Meta’s code-focused model

The daily driver. A 32B model is the practical ceiling for “always loaded” on 48-64GB machines. It handles 80% of coding tasks with quality approaching cloud models.

Tier 3: Large (70B+ Parameters)#

Memory: 40-52 GB (Q4 quantized) Speed: 5-15 tokens/second on Apple Silicon Cost: $0 (local) vs $0.01-0.06 per call (cloud)

What they add over medium models:

Complex multi-step reasoning
Deep architectural analysis across large codebases
Subtle bug detection requiring broad context
Natural language quality approaching cloud models

Recommended models:

Model	Parameters	Strength
Llama 3.3 70B	70B	Best reasoning at this size, strong code understanding
Qwen 2.5 72B	72B	Competitive with Llama, good for multilingual
DeepSeek R1 70B	70B	Reasoning-focused, good for complex analysis

Load on demand. 70B models consume most of your memory. Stop your 32B daily driver first, load the 70B for the specific complex task, then switch back.

Task-Model Matching#

The decision flowchart:

Is the output structured (JSON, classification, tool call)?
  └── YES → Can you define the exact output schema?
        └── YES → Use 3-4B model (Qwen3-4B, Ministral-3B)
        └── NO  → Use 7B model (Qwen 2.5 Coder 7B)

Is the task single-file analysis?
  └── YES → Is it extraction or summarization?
        └── YES → Use 7B model
        └── NO (review, refactoring) → Use 32B model

Is the task multi-file analysis?
  └── YES → Can you summarize files first, then correlate?
        └── YES → Use 7B for summaries + 32B for correlation
        └── NO (need full context) → Use 32B or 70B

Is the task complex reasoning or architecture-level?
  └── YES → Use 70B locally or escalate to cloud API

Empirical Results#

From benchmarking across structured extraction, classification, function calling, and summarization:

Task	3-4B Quality	7B Quality	32B Quality	Cloud (GPT-4/Claude)
JSON extraction	85-92%	90-95%	95-98%	97-99%
Classification	80-90%	88-95%	93-97%	96-99%
Function calling	75-88%	85-93%	92-97%	95-99%
Summarization	70-80%	80-88%	88-93%	93-97%
Code review	40-55%	55-70%	75-85%	88-95%
Multi-file reasoning	20-35%	40-55%	65-80%	85-95%

These ranges reflect variation across models within each tier and across task difficulty. The key insight: small models match or approach cloud quality on constrained tasks, but fall off sharply on open-ended reasoning.

Cost Comparison#

Per-Call Cost#

Provider	Model	Input Cost (1K tokens)	Output Cost (1K tokens)	Total (typical call)
Local (Ollama)	Qwen3-4B	$0	$0	$0
Local (Ollama)	Qwen 2.5 Coder 32B	$0	$0	$0
Local (Ollama)	Llama 3.3 70B	$0	$0	$0
Anthropic	Claude Sonnet 4.5	$0.003	$0.015	~$0.003
Anthropic	Claude Opus 4.6	$0.015	$0.075	~$0.015
OpenAI	GPT-4o	$0.005	$0.015	~$0.005

A typical extraction call processes ~500 input tokens and generates ~200 output tokens. At 1000 calls/day:

Local 4B: $0/day, $0/month
Claude Sonnet: ~$3/day, ~$90/month
Claude Opus: ~$15/day, ~$450/month

Hardware Amortization#

The hardware cost is real but amortized:

Hardware	Cost	Monthly Amortized (3yr)	Models Supported
Mac Mini M4 Pro 48GB	~$1,800	~$50/mo	Up to 32B daily driver
Mac Mini M4 Pro 64GB	~$2,200	~$61/mo	Up to 70B on demand
Linux + RTX 4090 (24GB)	~$2,500	~$69/mo	Up to 32B
Linux + 2x RTX 4090	~$4,500	~$125/mo	Up to 70B

At 1000+ calls/day, local inference pays for itself within months compared to cloud APIs. At lower volumes, the convenience and quality of cloud APIs may justify the cost.

When to Use Cloud Instead#

Local models are not always the right choice:

Task requires frontier-model reasoning. Complex multi-step analysis where 70B local is not good enough.
Latency budget is tight. Cloud APIs can have lower time-to-first-token due to optimized serving infrastructure.
Volume is low. Under ~100 calls/day, the hardware cost is not justified.
You need the latest capabilities. Cloud models are updated frequently. Local models lag by weeks to months.
Compliance requires specific providers. Some regulated environments mandate specific cloud providers with BAAs and certifications.

The Hybrid Strategy#

The most practical approach is not “local only” or “cloud only” — it is routing by task:

Incoming task
  │
  ├── Structured extraction → Local 3-4B (instant, free)
  ├── Classification/routing → Local 3-4B (instant, free)
  ├── File summarization → Local 7B (fast, free)
  ├── Code review → Local 32B (good, free)
  ├── Multi-file correlation → Local 32B (good, free)
  ├── Complex architecture → Local 70B (slower, free)
  └── Frontier reasoning → Cloud API (best quality, paid)

Route the 80% of tasks that are structured and constrained to small local models. Reserve cloud APIs for the 20% that genuinely need frontier intelligence. This is 10-30x cheaper than sending everything to the cloud while maintaining quality where it matters.

Common Mistakes#

Using 32B for everything. A 32B model doing JSON extraction is like using a forklift to carry a grocery bag. The 4B model is faster, uses less memory, and produces equivalent output for constrained tasks.
Dismissing small models based on general benchmarks. General benchmarks (MMLU, HumanEval) test broad reasoning. Your extraction task is a narrow, constrained problem where small models excel. Test on your actual task, not on benchmarks.
Not testing quantization levels. The default Q4_K_M quantization is good but not always optimal. For tasks where quality is borderline, trying Q5_K_M can push a smaller model over the threshold, avoiding the need to step up a tier.
Ignoring cold start time. The first call after loading a model is slower (model loads from disk to GPU). For latency-sensitive applications, keep the model warm with periodic pings.
Comparing local model quality on creative tasks. Local models lag behind cloud models on open-ended generation. But most agent workflows are not creative — they are structured operations where local models are competitive.