Choosing a Local Model#

The most expensive mistake in local LLM adoption is running a 70B model for a task that a 3B model handles at 20x the speed for equivalent quality. The second most expensive mistake is running a 3B model on a task that requires 32B-level reasoning and getting garbage output.

Matching model size to task complexity is the core skill. This guide provides a framework grounded in empirical benchmarks, not marketing claims.

Model Size Tiers#

Tier 1: Small (2-7B Parameters)#

Memory: 2-5 GB (Q4 quantized) Speed: 30-100 tokens/second on Apple Silicon Cost: $0 (local) vs $0.001-0.01 per call (cloud)

What they do well:

  • Structured extraction (parse text into JSON fields)
  • Classification and routing (categorize inputs into predefined labels)
  • Function calling (select a tool and fill parameters from a small schema)
  • Summarization (compress focused inputs into shorter text)
  • Format conversion (Markdown to JSON, log to structured event)
  • Validation and gatekeeping (check schema compliance, input safety)

What they do poorly:

  • Multi-step reasoning (chaining logical deductions)
  • Cross-file analysis (understanding relationships across many files)
  • Nuanced code review (catching subtle bugs that require deep understanding)
  • Open-ended generation (creative writing, complex explanations)

Recommended models:

Model Parameters Strength
Qwen3-4B 4B Best all-rounder at this size; matches 120B teacher on 7/8 tasks when fine-tuned
Ministral-3B 3B Purpose-built for function calling and JSON output
Phi-3-mini 3.8B Strong reasoning for its size
Llama 3.2-3B 3B Solid baseline, widely supported
Gemma-2-2B 2B Google’s smallest, good for classification

Tier 2: Medium (13-32B Parameters)#

Memory: 10-22 GB (Q4 quantized) Speed: 15-40 tokens/second on Apple Silicon Cost: $0 (local) vs $0.003-0.03 per call (cloud)

What they add over small models:

  • Multi-file reasoning (understanding how components relate)
  • Code review with context (catching bugs, suggesting improvements)
  • Complex summarization (preserving nuance across long inputs)
  • Architecture analysis (identifying patterns and anti-patterns)
  • Refactoring suggestions (proposing structural changes with rationale)

Recommended models:

Model Parameters Strength
Qwen 2.5 Coder 32B 32B Best local model for code: compilation correctness, refactoring, review
DeepSeek Coder V2 16B/236B MoE Strong code generation, efficient MoE architecture
CodeLlama 34B 34B Meta’s code-focused model

The daily driver. A 32B model is the practical ceiling for “always loaded” on 48-64GB machines. It handles 80% of coding tasks with quality approaching cloud models.

Tier 3: Large (70B+ Parameters)#

Memory: 40-52 GB (Q4 quantized) Speed: 5-15 tokens/second on Apple Silicon Cost: $0 (local) vs $0.01-0.06 per call (cloud)

What they add over medium models:

  • Complex multi-step reasoning
  • Deep architectural analysis across large codebases
  • Subtle bug detection requiring broad context
  • Natural language quality approaching cloud models

Recommended models:

Model Parameters Strength
Llama 3.3 70B 70B Best reasoning at this size, strong code understanding
Qwen 2.5 72B 72B Competitive with Llama, good for multilingual
DeepSeek R1 70B 70B Reasoning-focused, good for complex analysis

Load on demand. 70B models consume most of your memory. Stop your 32B daily driver first, load the 70B for the specific complex task, then switch back.

Task-Model Matching#

The decision flowchart:

Is the output structured (JSON, classification, tool call)?
  └── YES → Can you define the exact output schema?
        └── YES → Use 3-4B model (Qwen3-4B, Ministral-3B)
        └── NO  → Use 7B model (Qwen 2.5 Coder 7B)

Is the task single-file analysis?
  └── YES → Is it extraction or summarization?
        └── YES → Use 7B model
        └── NO (review, refactoring) → Use 32B model

Is the task multi-file analysis?
  └── YES → Can you summarize files first, then correlate?
        └── YES → Use 7B for summaries + 32B for correlation
        └── NO (need full context) → Use 32B or 70B

Is the task complex reasoning or architecture-level?
  └── YES → Use 70B locally or escalate to cloud API

Empirical Results#

From benchmarking across structured extraction, classification, function calling, and summarization:

Task 3-4B Quality 7B Quality 32B Quality Cloud (GPT-4/Claude)
JSON extraction 85-92% 90-95% 95-98% 97-99%
Classification 80-90% 88-95% 93-97% 96-99%
Function calling 75-88% 85-93% 92-97% 95-99%
Summarization 70-80% 80-88% 88-93% 93-97%
Code review 40-55% 55-70% 75-85% 88-95%
Multi-file reasoning 20-35% 40-55% 65-80% 85-95%

These ranges reflect variation across models within each tier and across task difficulty. The key insight: small models match or approach cloud quality on constrained tasks, but fall off sharply on open-ended reasoning.

Cost Comparison#

Per-Call Cost#

Provider Model Input Cost (1K tokens) Output Cost (1K tokens) Total (typical call)
Local (Ollama) Qwen3-4B $0 $0 $0
Local (Ollama) Qwen 2.5 Coder 32B $0 $0 $0
Local (Ollama) Llama 3.3 70B $0 $0 $0
Anthropic Claude Sonnet 4.5 $0.003 $0.015 ~$0.003
Anthropic Claude Opus 4.6 $0.015 $0.075 ~$0.015
OpenAI GPT-4o $0.005 $0.015 ~$0.005

A typical extraction call processes ~500 input tokens and generates ~200 output tokens. At 1000 calls/day:

  • Local 4B: $0/day, $0/month
  • Claude Sonnet: ~$3/day, ~$90/month
  • Claude Opus: ~$15/day, ~$450/month

Hardware Amortization#

The hardware cost is real but amortized:

Hardware Cost Monthly Amortized (3yr) Models Supported
Mac Mini M4 Pro 48GB ~$1,800 ~$50/mo Up to 32B daily driver
Mac Mini M4 Pro 64GB ~$2,200 ~$61/mo Up to 70B on demand
Linux + RTX 4090 (24GB) ~$2,500 ~$69/mo Up to 32B
Linux + 2x RTX 4090 ~$4,500 ~$125/mo Up to 70B

At 1000+ calls/day, local inference pays for itself within months compared to cloud APIs. At lower volumes, the convenience and quality of cloud APIs may justify the cost.

When to Use Cloud Instead#

Local models are not always the right choice:

  • Task requires frontier-model reasoning. Complex multi-step analysis where 70B local is not good enough.
  • Latency budget is tight. Cloud APIs can have lower time-to-first-token due to optimized serving infrastructure.
  • Volume is low. Under ~100 calls/day, the hardware cost is not justified.
  • You need the latest capabilities. Cloud models are updated frequently. Local models lag by weeks to months.
  • Compliance requires specific providers. Some regulated environments mandate specific cloud providers with BAAs and certifications.

The Hybrid Strategy#

The most practical approach is not “local only” or “cloud only” — it is routing by task:

Incoming task
  │
  ├── Structured extraction → Local 3-4B (instant, free)
  ├── Classification/routing → Local 3-4B (instant, free)
  ├── File summarization → Local 7B (fast, free)
  ├── Code review → Local 32B (good, free)
  ├── Multi-file correlation → Local 32B (good, free)
  ├── Complex architecture → Local 70B (slower, free)
  └── Frontier reasoning → Cloud API (best quality, paid)

Route the 80% of tasks that are structured and constrained to small local models. Reserve cloud APIs for the 20% that genuinely need frontier intelligence. This is 10-30x cheaper than sending everything to the cloud while maintaining quality where it matters.

Common Mistakes#

  1. Using 32B for everything. A 32B model doing JSON extraction is like using a forklift to carry a grocery bag. The 4B model is faster, uses less memory, and produces equivalent output for constrained tasks.
  2. Dismissing small models based on general benchmarks. General benchmarks (MMLU, HumanEval) test broad reasoning. Your extraction task is a narrow, constrained problem where small models excel. Test on your actual task, not on benchmarks.
  3. Not testing quantization levels. The default Q4_K_M quantization is good but not always optimal. For tasks where quality is borderline, trying Q5_K_M can push a smaller model over the threshold, avoiding the need to step up a tier.
  4. Ignoring cold start time. The first call after loading a model is slower (model loads from disk to GPU). For latency-sensitive applications, keep the model warm with periodic pings.
  5. Comparing local model quality on creative tasks. Local models lag behind cloud models on open-ended generation. But most agent workflows are not creative — they are structured operations where local models are competitive.