Ollama Setup and Model Management#

Ollama turns running local LLMs into a single command. It handles model downloads, quantization, GPU memory allocation, and exposes a REST API that any application can call. No Python environments, no CUDA driver debugging, no manual GGUF file management.

Installation#

# macOS
brew install ollama

# Linux (official installer)
curl -fsSL https://ollama.com/install.sh | sh

# Or run as a Docker container
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Start the Ollama server:

# macOS: Ollama runs as a menu bar app or
ollama serve

# Linux: systemd service
sudo systemctl enable ollama
sudo systemctl start ollama

Verify it is running:

curl http://localhost:11434/api/tags

Pulling and Running Models#

# Pull a model (downloads once, reuses after)
ollama pull qwen2.5-coder:7b

# Run interactively
ollama run qwen2.5-coder:7b

# List downloaded models
ollama list

# Show model details (parameters, quantization, size)
ollama show qwen2.5-coder:7b

Model Naming Convention#

Ollama model names follow the pattern name:tag where the tag indicates size and quantization:

qwen2.5-coder:7b        # 7B parameters, default quantization (Q4_K_M)
qwen2.5-coder:7b-q8_0   # 7B parameters, Q8 quantization (higher quality, more memory)
qwen2.5-coder:32b       # 32B parameters
llama3.3:70b             # 70B parameters
phi3:mini                # 3.8B parameters (alias)

Quantization and Quality Tradeoffs#

Quantization reduces model precision to use less memory. Ollama models default to Q4_K_M, which is a good balance:

Quantization Bits per Weight Memory (7B) Memory (32B) Quality Impact
Q4_K_M ~4.5 ~5 GB ~22 GB Slight degradation, good for most tasks
Q5_K_M ~5.5 ~6 GB ~26 GB Minimal degradation
Q6_K ~6.5 ~7 GB ~30 GB Near-original quality
Q8_0 8 ~8 GB ~36 GB Essentially lossless
FP16 16 ~14 GB ~64 GB Original precision

For code tasks, Q4_K_M is sufficient for extraction and classification. For complex reasoning where every token matters, Q5_K_M or Q6_K can measurably improve output quality.

Memory Management#

This is where most people hit problems. Understanding how Ollama manages GPU memory prevents the most common issues.

How Ollama Loads Models#

When you run or call a model, Ollama loads it into GPU memory (unified memory on Apple Silicon, VRAM on discrete GPUs). The model stays loaded after the request completes for fast subsequent requests.

# See what models are currently loaded
ollama ps

# Output:
# NAME                     SIZE      PROCESSOR    UNTIL
# qwen2.5-coder:32b        22 GB     100% GPU     4 minutes from now

Models are evicted after an idle timeout (default 5 minutes). You can explicitly stop a model:

ollama stop qwen2.5-coder:32b

Memory Budget Planning#

Plan your model loading around your available memory:

Hardware Total Memory Usable for Models Recommended Setup
Mac Mini M4 Pro (48GB) 48 GB ~38 GB (OS needs ~10GB) 32B daily + 7B worker simultaneously
Mac Mini M4 Pro (64GB) 64 GB ~52 GB 70B loaded, or 32B + 7B + embeddings
Linux with 24GB VRAM 24 GB ~22 GB 32B quantized, or two 7B models
Linux with 48GB VRAM 48 GB ~44 GB 70B quantized

Running Multiple Models#

Ollama can hold multiple models in memory simultaneously if they fit:

# Load a small model for fast extraction
ollama run qwen2.5-coder:7b "extract the function names from this code"

# While 7B is still loaded, call the 32B for correlation
ollama run qwen2.5-coder:32b "analyze these summaries for architectural issues"

# Both stay in memory if RAM permits
ollama ps
# qwen2.5-coder:7b     5 GB    100% GPU    4 minutes
# qwen2.5-coder:32b    22 GB   100% GPU    4 minutes

When memory is tight and you need a larger model:

# Explicitly stop the 32B to free memory for the 70B
ollama stop qwen2.5-coder:32b
ollama run llama3.3:70b "deep analysis of this architecture"

Apple Silicon Unified Memory#

On Apple Silicon Macs (M1/M2/M3/M4), CPU and GPU share the same memory pool. This means:

  • Models run on the GPU (Metal) natively with no data copying.
  • Token generation speed is excellent (30-80 tokens/sec for 7B, 15-30 for 32B on M4 Pro).
  • The OS, applications, and models all compete for the same memory pool. Budget 10GB for the OS and apps.
  • There is no discrete VRAM — “GPU memory” and “system RAM” are the same thing.

ARM64 Native#

Ollama on Apple Silicon and ARM64 Linux runs natively. There is no emulation layer. This matters because:

  • Performance is significantly better than x86 emulation (Rosetta or QEMU).
  • Models that depend on specific CPU instructions (AVX2 on x86) may need ARM64-specific builds — Ollama handles this automatically.
  • Docker on ARM64 Macs uses the native ARM64 Ollama image.

The Ollama REST API#

Every Ollama command maps to an HTTP API call. Applications should use the API, not shell out to the CLI:

# Generate a completion
curl http://localhost:11434/api/chat -d '{
  "model": "qwen2.5-coder:7b",
  "messages": [{"role": "user", "content": "explain this function"}],
  "stream": false
}'

# Generate embeddings
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "function calculateTotal(items) { return items.reduce(...) }"
}'

# List loaded models
curl http://localhost:11434/api/ps

API Options#

Control generation behavior per-request:

{
  "model": "qwen2.5-coder:7b",
  "messages": [{"role": "user", "content": "extract fields as JSON"}],
  "stream": false,
  "options": {
    "temperature": 0.0,
    "num_predict": 1024,
    "num_ctx": 8192,
    "top_p": 0.9
  },
  "format": "json"
}

Key options:

  • temperature — 0.0 for deterministic output (extraction, classification). 0.7+ for creative generation.
  • num_predict — Maximum tokens to generate. Critical for small models that can loop in JSON mode (see Structured Output article).
  • num_ctx — Context window size. Larger contexts use more memory. Default varies by model (4096-131072).
  • format: "json" — Forces JSON output. The model wraps its response in valid JSON.

Client Libraries#

Go#

import "github.com/ollama/ollama/api"

client, _ := api.ClientFromEnvironment()
resp, _ := client.Chat(ctx, &api.ChatRequest{
    Model:  "qwen2.5-coder:7b",
    Messages: []api.Message{{Role: "user", Content: prompt}},
    Options: map[string]interface{}{
        "temperature": 0.0,
        "num_predict": 1024,
    },
})

Python#

import ollama

response = ollama.chat(
    model="qwen2.5-coder:7b",
    messages=[{"role": "user", "content": prompt}],
    options={"temperature": 0.0, "num_predict": 1024},
    format="json",
)
print(response["message"]["content"])

HTTP (Language-Agnostic)#

Any language with an HTTP client can call Ollama. The API is simple JSON over HTTP — no SDK required.

Pre-Flight Checks#

Before integrating Ollama into a workflow, verify the setup:

# Is Ollama running?
curl -s http://localhost:11434/api/tags > /dev/null && echo "OK" || echo "Ollama not running"

# Is the required model pulled?
ollama list | grep -q "qwen2.5-coder:7b" && echo "Model ready" || echo "Pull model first"

# How much memory is available?
ollama ps  # Shows loaded models and their memory usage

# Test a generation
ollama run qwen2.5-coder:7b "say hello" --verbose 2>&1 | grep "eval rate"
# Shows tokens/second — expect 30-80 tok/s for 7B on M4 Pro

Common Mistakes#

  1. Not checking loaded models before loading a new one. Ollama does not warn when loading a model that will not fit. It silently falls back to CPU inference, which is 10-50x slower. Check ollama ps and stop unneeded models first.
  2. Using default context window for large inputs. The default context varies by model. If your input exceeds it, the model silently truncates. Set num_ctx explicitly based on your input size.
  3. Shelling out to ollama run instead of using the API. The CLI adds overhead (process startup, output parsing). Use the HTTP API or a client library for programmatic access.
  4. Expecting cloud-model quality from 7B models. A 7B model is excellent for extraction, classification, and structured output. It is not a replacement for GPT-4 or Claude on complex reasoning. Match model size to task complexity.
  5. Not pinning model versions. ollama pull qwen2.5-coder:7b pulls the latest tag, which can change. For reproducible results in production, record the model digest from ollama show --modelfile and verify it matches.