Ollama Setup and Model Management#
Ollama turns running local LLMs into a single command. It handles model downloads, quantization, GPU memory allocation, and exposes a REST API that any application can call. No Python environments, no CUDA driver debugging, no manual GGUF file management.
Installation#
# macOS
brew install ollama
# Linux (official installer)
curl -fsSL https://ollama.com/install.sh | sh
# Or run as a Docker container
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollamaStart the Ollama server:
# macOS: Ollama runs as a menu bar app or
ollama serve
# Linux: systemd service
sudo systemctl enable ollama
sudo systemctl start ollamaVerify it is running:
curl http://localhost:11434/api/tagsPulling and Running Models#
# Pull a model (downloads once, reuses after)
ollama pull qwen2.5-coder:7b
# Run interactively
ollama run qwen2.5-coder:7b
# List downloaded models
ollama list
# Show model details (parameters, quantization, size)
ollama show qwen2.5-coder:7bModel Naming Convention#
Ollama model names follow the pattern name:tag where the tag indicates size and quantization:
qwen2.5-coder:7b # 7B parameters, default quantization (Q4_K_M)
qwen2.5-coder:7b-q8_0 # 7B parameters, Q8 quantization (higher quality, more memory)
qwen2.5-coder:32b # 32B parameters
llama3.3:70b # 70B parameters
phi3:mini # 3.8B parameters (alias)Quantization and Quality Tradeoffs#
Quantization reduces model precision to use less memory. Ollama models default to Q4_K_M, which is a good balance:
| Quantization | Bits per Weight | Memory (7B) | Memory (32B) | Quality Impact |
|---|---|---|---|---|
| Q4_K_M | ~4.5 | ~5 GB | ~22 GB | Slight degradation, good for most tasks |
| Q5_K_M | ~5.5 | ~6 GB | ~26 GB | Minimal degradation |
| Q6_K | ~6.5 | ~7 GB | ~30 GB | Near-original quality |
| Q8_0 | 8 | ~8 GB | ~36 GB | Essentially lossless |
| FP16 | 16 | ~14 GB | ~64 GB | Original precision |
For code tasks, Q4_K_M is sufficient for extraction and classification. For complex reasoning where every token matters, Q5_K_M or Q6_K can measurably improve output quality.
Memory Management#
This is where most people hit problems. Understanding how Ollama manages GPU memory prevents the most common issues.
How Ollama Loads Models#
When you run or call a model, Ollama loads it into GPU memory (unified memory on Apple Silicon, VRAM on discrete GPUs). The model stays loaded after the request completes for fast subsequent requests.
# See what models are currently loaded
ollama ps
# Output:
# NAME SIZE PROCESSOR UNTIL
# qwen2.5-coder:32b 22 GB 100% GPU 4 minutes from nowModels are evicted after an idle timeout (default 5 minutes). You can explicitly stop a model:
ollama stop qwen2.5-coder:32bMemory Budget Planning#
Plan your model loading around your available memory:
| Hardware | Total Memory | Usable for Models | Recommended Setup |
|---|---|---|---|
| Mac Mini M4 Pro (48GB) | 48 GB | ~38 GB (OS needs ~10GB) | 32B daily + 7B worker simultaneously |
| Mac Mini M4 Pro (64GB) | 64 GB | ~52 GB | 70B loaded, or 32B + 7B + embeddings |
| Linux with 24GB VRAM | 24 GB | ~22 GB | 32B quantized, or two 7B models |
| Linux with 48GB VRAM | 48 GB | ~44 GB | 70B quantized |
Running Multiple Models#
Ollama can hold multiple models in memory simultaneously if they fit:
# Load a small model for fast extraction
ollama run qwen2.5-coder:7b "extract the function names from this code"
# While 7B is still loaded, call the 32B for correlation
ollama run qwen2.5-coder:32b "analyze these summaries for architectural issues"
# Both stay in memory if RAM permits
ollama ps
# qwen2.5-coder:7b 5 GB 100% GPU 4 minutes
# qwen2.5-coder:32b 22 GB 100% GPU 4 minutesWhen memory is tight and you need a larger model:
# Explicitly stop the 32B to free memory for the 70B
ollama stop qwen2.5-coder:32b
ollama run llama3.3:70b "deep analysis of this architecture"Apple Silicon Unified Memory#
On Apple Silicon Macs (M1/M2/M3/M4), CPU and GPU share the same memory pool. This means:
- Models run on the GPU (Metal) natively with no data copying.
- Token generation speed is excellent (30-80 tokens/sec for 7B, 15-30 for 32B on M4 Pro).
- The OS, applications, and models all compete for the same memory pool. Budget 10GB for the OS and apps.
- There is no discrete VRAM — “GPU memory” and “system RAM” are the same thing.
ARM64 Native#
Ollama on Apple Silicon and ARM64 Linux runs natively. There is no emulation layer. This matters because:
- Performance is significantly better than x86 emulation (Rosetta or QEMU).
- Models that depend on specific CPU instructions (AVX2 on x86) may need ARM64-specific builds — Ollama handles this automatically.
- Docker on ARM64 Macs uses the native ARM64 Ollama image.
The Ollama REST API#
Every Ollama command maps to an HTTP API call. Applications should use the API, not shell out to the CLI:
# Generate a completion
curl http://localhost:11434/api/chat -d '{
"model": "qwen2.5-coder:7b",
"messages": [{"role": "user", "content": "explain this function"}],
"stream": false
}'
# Generate embeddings
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": "function calculateTotal(items) { return items.reduce(...) }"
}'
# List loaded models
curl http://localhost:11434/api/psAPI Options#
Control generation behavior per-request:
{
"model": "qwen2.5-coder:7b",
"messages": [{"role": "user", "content": "extract fields as JSON"}],
"stream": false,
"options": {
"temperature": 0.0,
"num_predict": 1024,
"num_ctx": 8192,
"top_p": 0.9
},
"format": "json"
}Key options:
temperature— 0.0 for deterministic output (extraction, classification). 0.7+ for creative generation.num_predict— Maximum tokens to generate. Critical for small models that can loop in JSON mode (see Structured Output article).num_ctx— Context window size. Larger contexts use more memory. Default varies by model (4096-131072).format: "json"— Forces JSON output. The model wraps its response in valid JSON.
Client Libraries#
Go#
import "github.com/ollama/ollama/api"
client, _ := api.ClientFromEnvironment()
resp, _ := client.Chat(ctx, &api.ChatRequest{
Model: "qwen2.5-coder:7b",
Messages: []api.Message{{Role: "user", Content: prompt}},
Options: map[string]interface{}{
"temperature": 0.0,
"num_predict": 1024,
},
})Python#
import ollama
response = ollama.chat(
model="qwen2.5-coder:7b",
messages=[{"role": "user", "content": prompt}],
options={"temperature": 0.0, "num_predict": 1024},
format="json",
)
print(response["message"]["content"])HTTP (Language-Agnostic)#
Any language with an HTTP client can call Ollama. The API is simple JSON over HTTP — no SDK required.
Pre-Flight Checks#
Before integrating Ollama into a workflow, verify the setup:
# Is Ollama running?
curl -s http://localhost:11434/api/tags > /dev/null && echo "OK" || echo "Ollama not running"
# Is the required model pulled?
ollama list | grep -q "qwen2.5-coder:7b" && echo "Model ready" || echo "Pull model first"
# How much memory is available?
ollama ps # Shows loaded models and their memory usage
# Test a generation
ollama run qwen2.5-coder:7b "say hello" --verbose 2>&1 | grep "eval rate"
# Shows tokens/second — expect 30-80 tok/s for 7B on M4 ProCommon Mistakes#
- Not checking loaded models before loading a new one. Ollama does not warn when loading a model that will not fit. It silently falls back to CPU inference, which is 10-50x slower. Check
ollama psand stop unneeded models first. - Using default context window for large inputs. The default context varies by model. If your input exceeds it, the model silently truncates. Set
num_ctxexplicitly based on your input size. - Shelling out to
ollama runinstead of using the API. The CLI adds overhead (process startup, output parsing). Use the HTTP API or a client library for programmatic access. - Expecting cloud-model quality from 7B models. A 7B model is excellent for extraction, classification, and structured output. It is not a replacement for GPT-4 or Claude on complex reasoning. Match model size to task complexity.
- Not pinning model versions.
ollama pull qwen2.5-coder:7bpulls the latest tag, which can change. For reproducible results in production, record the model digest fromollama show --modelfileand verify it matches.