Prompt Engineering for Local Models#
Prompting a 7B local model is not the same as prompting Claude or GPT-4. Cloud models are overtrained on instruction following, tolerate vague prompts, and self-correct. Small local models need more structure, more constraints, and more explicit formatting instructions. The prompts that work effortlessly on cloud models often produce garbage on local models.
This is not a weakness — it is a design consideration. Local models trade generality for speed and cost. Your prompts must compensate by being more specific.
How Local Models Differ#
Less Instruction Following#
Cloud models are trained with extensive RLHF (reinforcement learning from human feedback) to follow instructions precisely. A 7B model has less of this training. Result:
# Works on Claude, fails on 7B local
"Analyze this code for security issues. Be thorough but concise.
Focus on injection vulnerabilities and authentication bypass."
# Works on both
"List security issues in this code. For each issue, output:
- LINE: the line number
- TYPE: one of [injection, auth_bypass, xss, ssrf, other]
- SEVERITY: one of [low, medium, high, critical]
- DESCRIPTION: one sentence explaining the issue"The second prompt constrains the output format so the model does not have to decide how to structure its response.
Narrower Context Understanding#
Cloud models maintain coherence over long prompts. Local models lose focus after a few hundred tokens of instructions. Front-load the important parts:
# Bad: important instruction buried at the end
"Here is a 500-line source file. <file content>
By the way, only focus on the error handling patterns."
# Good: instruction first, content second
"Focus ONLY on error handling patterns in the following code.
For each error handling pattern found, output the function name
and how errors are propagated.
<file content>"Less Self-Correction#
Cloud models catch their own mistakes mid-generation. Local models commit to their first token and follow through. If the first few tokens go wrong, the rest follows:
# Local model starts generating a list when you wanted prose
"1. The first issue is..." → continues as a numbered list even if you wanted paragraphs
# Fix: explicitly state the format
"Write your response as continuous paragraphs, not as a list."The Preset Pattern#
Presets are reusable prompt templates with a defined focus area and output format. Instead of writing a new prompt for every task, select a preset.
Defining Presets#
PRESETS = {
"architecture": {
"system": "You are a software architect analyzing code structure.",
"focus": "dependencies, imports, data flow, coupling between components, design patterns",
"output_format": "Organize findings by theme. Reference file names.",
"question": "How do the components of this codebase fit together?",
},
"security": {
"system": "You are a security auditor reviewing code for vulnerabilities.",
"focus": "input validation, authentication, authorization, secrets in code, error messages that leak information, injection points",
"output_format": "For each finding: file, line/function, severity (low/medium/high/critical), description, fix.",
"question": "What security vulnerabilities exist in this code?",
},
"review": {
"system": "You are a senior developer reviewing code for bugs.",
"focus": "bugs, edge cases, off-by-one errors, null/nil handling, unchecked errors, race conditions, resource leaks",
"output_format": "For each issue: file, function, severity, description, suggested fix.",
"question": "What bugs and issues exist in this code?",
},
"consistency": {
"system": "You are reviewing code for consistency across a codebase.",
"focus": "naming conventions, error handling patterns, logging patterns, API conventions, code style",
"output_format": "Group inconsistencies by category. Show examples from specific files.",
"question": "What inconsistencies exist across this codebase?",
},
"document": {
"system": "You are a technical writer documenting code.",
"focus": "purpose, public API, parameters, return values, side effects, usage examples",
"output_format": "Markdown documentation with code examples.",
"question": "Generate documentation for this code.",
},
}Using Presets#
def build_prompt(preset_name: str, content: str, custom_question: str = None) -> list[dict]:
preset = PRESETS[preset_name]
question = custom_question or preset["question"]
return [
{"role": "system", "content": preset["system"]},
{"role": "user", "content": f"""Focus on: {preset["focus"]}
Output format: {preset["output_format"]}
{content}
Question: {question}"""},
]Presets eliminate prompt drift — the tendency to tweak prompts per-run until they work for one example but fail on others. A well-tested preset works consistently across inputs.
Schema-Driven Prompts#
For structured output, include the exact schema in the prompt. This is the single most impactful technique for local models.
Extraction Schema#
def extraction_prompt(text: str, schema: dict) -> str:
return f"""Extract information from the following text.
Return a JSON object matching this EXACT schema:
{json.dumps(schema, indent=2)}
Rules:
- Use ONLY the values specified in "enum" fields.
- Set fields to null if the information is not present.
- Do not add fields not in the schema.
Text:
{text}"""Classification Schema#
LABELS = ["bug", "feature", "question", "docs"]
def classification_prompt(text: str) -> str:
return f"""Classify the following text into exactly ONE category.
Categories: {json.dumps(LABELS)}
Return JSON: {{"category": "<one of the categories>", "confidence": <0.0 to 1.0>}}
Text:
{text}"""The explicit listing of valid values and the exact JSON format leaves no ambiguity for the model.
Prompt Debugging#
When a prompt produces wrong output, diagnose systematically:
1. Check the Output Format#
Is the model generating the right format? If you expect JSON and get prose, the format instruction is not strong enough.
Fix: Add format="json" to the Ollama call AND include JSON format instructions in the prompt. Belt and suspenders.
2. Check the First Few Tokens#
Local models commit early. If the first token is wrong, everything after follows:
Expected: {"category": "bug", ...}
Got: The category of this issue is bug because...
Fix: Start the assistant message with the opening bracemessages = [
{"role": "user", "content": prompt},
{"role": "assistant", "content": "{"}, # Prime the model to start with JSON
]3. Check Token Budget#
If JSON output is truncated, num_predict is too low:
Got: {"category": "bug", "details": "This is a significant issue that affects the core authenticatFix: Increase num_predict or simplify the requested output.
4. Check Temperature#
Temperature > 0 adds randomness. For extraction and classification, randomness is noise:
# WRONG for structured output
options={"temperature": 0.7}
# RIGHT for structured output
options={"temperature": 0.0}Use temperature > 0 only for generation tasks where variety is desired (writing, brainstorming).
5. Compare Against a Larger Model#
If the prompt works on 32B but fails on 7B, the task may be too complex for the small model. Options:
- Simplify the prompt (fewer fields, simpler schema)
- Split into multiple smaller calls
- Use the larger model
Prompt Anti-Patterns for Local Models#
Too Many Instructions#
# Bad: 8 instructions, local model loses track after 3
"Analyze this code. Focus on security. Also check performance.
Consider edge cases. Look at error handling. Check naming conventions.
Verify the API contract. Suggest refactoring opportunities."
# Good: one clear instruction
"List security vulnerabilities in this code.
For each: file, line, severity (low/medium/high/critical), description."Implicit Output Format#
# Bad: model decides format
"What's wrong with this code?"
# Good: explicit format
"List issues in this code as a JSON array:
[{\"file\": \"...\", \"line\": N, \"issue\": \"...\", \"severity\": \"...\"}]"Negative Instructions#
# Bad: models are weak at "don't"
"Don't include explanations. Don't add commentary. Don't use markdown."
# Good: say what you want, not what you don't want
"Output ONLY the JSON object. No text before or after the JSON."Asking for Confidence Without Calibration#
Small models are poorly calibrated on confidence. A 7B model saying “confidence: 0.95” does not mean it is 95% likely to be correct. Use confidence scores for relative ranking (higher is more likely correct than lower), not as absolute probabilities.
Testing Prompts#
Never deploy a prompt tested on one example. Build a small test suite:
TEST_CASES = [
{"input": "...", "expected": {"category": "bug", ...}},
{"input": "...", "expected": {"category": "feature", ...}},
{"input": "...", "expected": {"category": "question", ...}},
# Include edge cases:
{"input": "(empty string)", "expected": {"category": "other", ...}},
{"input": "(ambiguous input)", "expected": {"category": "question", ...}},
{"input": "(very long input)", "expected": {"category": "bug", ...}},
]
def test_prompt(prompt_fn, model, test_cases):
results = []
for case in test_cases:
actual = run_prompt(prompt_fn, model, case["input"])
match = actual == case["expected"]
results.append({"input": case["input"][:50], "match": match, "actual": actual})
accuracy = sum(r["match"] for r in results) / len(results)
print(f"Accuracy: {accuracy:.0%} ({sum(r['match'] for r in results)}/{len(results)})")
return resultsTest across at least 3 models at the target size tier. A prompt that works on Qwen but fails on Llama is too model-specific — generalize it.
Common Mistakes#
- Using cloud-model prompting style on local models. Vague instructions, implicit formats, and conversational prompts work on GPT-4 and Claude. Local models need explicit structure, constrained output, and front-loaded instructions.
- Not using the system message. The system message sets the model’s role and behavior. Local models respond noticeably better when given a clear system role (“You are a security auditor”) versus user-only prompts.
- Changing prompts based on one failure. A prompt that works 90% of the time should not be rewritten because of one bad output. Test on a suite. If accuracy is below your threshold, then adjust.
- Including unnecessary context. Local models have smaller effective context windows. Every token of irrelevant context reduces the quality of the response. Send only what the model needs.
- Expecting chain-of-thought from 3-4B models. Small models cannot reliably reason through multiple steps. If the task requires reasoning, either use a larger model or decompose the task into sequential calls where each step is simple.