Prompt Engineering for Local Models#

Prompting a 7B local model is not the same as prompting Claude or GPT-4. Cloud models are overtrained on instruction following, tolerate vague prompts, and self-correct. Small local models need more structure, more constraints, and more explicit formatting instructions. The prompts that work effortlessly on cloud models often produce garbage on local models.

This is not a weakness — it is a design consideration. Local models trade generality for speed and cost. Your prompts must compensate by being more specific.

How Local Models Differ#

Less Instruction Following#

Cloud models are trained with extensive RLHF (reinforcement learning from human feedback) to follow instructions precisely. A 7B model has less of this training. Result:

# Works on Claude, fails on 7B local
"Analyze this code for security issues. Be thorough but concise.
Focus on injection vulnerabilities and authentication bypass."

# Works on both
"List security issues in this code. For each issue, output:
- LINE: the line number
- TYPE: one of [injection, auth_bypass, xss, ssrf, other]
- SEVERITY: one of [low, medium, high, critical]
- DESCRIPTION: one sentence explaining the issue"

The second prompt constrains the output format so the model does not have to decide how to structure its response.

Narrower Context Understanding#

Cloud models maintain coherence over long prompts. Local models lose focus after a few hundred tokens of instructions. Front-load the important parts:

# Bad: important instruction buried at the end
"Here is a 500-line source file. <file content>
By the way, only focus on the error handling patterns."

# Good: instruction first, content second
"Focus ONLY on error handling patterns in the following code.
For each error handling pattern found, output the function name
and how errors are propagated.

<file content>"

Less Self-Correction#

Cloud models catch their own mistakes mid-generation. Local models commit to their first token and follow through. If the first few tokens go wrong, the rest follows:

# Local model starts generating a list when you wanted prose
"1. The first issue is..."  → continues as a numbered list even if you wanted paragraphs

# Fix: explicitly state the format
"Write your response as continuous paragraphs, not as a list."

The Preset Pattern#

Presets are reusable prompt templates with a defined focus area and output format. Instead of writing a new prompt for every task, select a preset.

Defining Presets#

PRESETS = {
    "architecture": {
        "system": "You are a software architect analyzing code structure.",
        "focus": "dependencies, imports, data flow, coupling between components, design patterns",
        "output_format": "Organize findings by theme. Reference file names.",
        "question": "How do the components of this codebase fit together?",
    },
    "security": {
        "system": "You are a security auditor reviewing code for vulnerabilities.",
        "focus": "input validation, authentication, authorization, secrets in code, error messages that leak information, injection points",
        "output_format": "For each finding: file, line/function, severity (low/medium/high/critical), description, fix.",
        "question": "What security vulnerabilities exist in this code?",
    },
    "review": {
        "system": "You are a senior developer reviewing code for bugs.",
        "focus": "bugs, edge cases, off-by-one errors, null/nil handling, unchecked errors, race conditions, resource leaks",
        "output_format": "For each issue: file, function, severity, description, suggested fix.",
        "question": "What bugs and issues exist in this code?",
    },
    "consistency": {
        "system": "You are reviewing code for consistency across a codebase.",
        "focus": "naming conventions, error handling patterns, logging patterns, API conventions, code style",
        "output_format": "Group inconsistencies by category. Show examples from specific files.",
        "question": "What inconsistencies exist across this codebase?",
    },
    "document": {
        "system": "You are a technical writer documenting code.",
        "focus": "purpose, public API, parameters, return values, side effects, usage examples",
        "output_format": "Markdown documentation with code examples.",
        "question": "Generate documentation for this code.",
    },
}

Using Presets#

def build_prompt(preset_name: str, content: str, custom_question: str = None) -> list[dict]:
    preset = PRESETS[preset_name]
    question = custom_question or preset["question"]

    return [
        {"role": "system", "content": preset["system"]},
        {"role": "user", "content": f"""Focus on: {preset["focus"]}

Output format: {preset["output_format"]}

{content}

Question: {question}"""},
    ]

Presets eliminate prompt drift — the tendency to tweak prompts per-run until they work for one example but fail on others. A well-tested preset works consistently across inputs.

Schema-Driven Prompts#

For structured output, include the exact schema in the prompt. This is the single most impactful technique for local models.

Extraction Schema#

def extraction_prompt(text: str, schema: dict) -> str:
    return f"""Extract information from the following text.
Return a JSON object matching this EXACT schema:

{json.dumps(schema, indent=2)}

Rules:
- Use ONLY the values specified in "enum" fields.
- Set fields to null if the information is not present.
- Do not add fields not in the schema.

Text:
{text}"""

Classification Schema#

LABELS = ["bug", "feature", "question", "docs"]

def classification_prompt(text: str) -> str:
    return f"""Classify the following text into exactly ONE category.

Categories: {json.dumps(LABELS)}

Return JSON: {{"category": "<one of the categories>", "confidence": <0.0 to 1.0>}}

Text:
{text}"""

The explicit listing of valid values and the exact JSON format leaves no ambiguity for the model.

Prompt Debugging#

When a prompt produces wrong output, diagnose systematically:

1. Check the Output Format#

Is the model generating the right format? If you expect JSON and get prose, the format instruction is not strong enough.

Fix: Add format="json" to the Ollama call AND include JSON format instructions in the prompt. Belt and suspenders.

2. Check the First Few Tokens#

Local models commit early. If the first token is wrong, everything after follows:

Expected: {"category": "bug", ...}
Got:      The category of this issue is bug because...

Fix: Start the assistant message with the opening brace
messages = [
    {"role": "user", "content": prompt},
    {"role": "assistant", "content": "{"},  # Prime the model to start with JSON
]

3. Check Token Budget#

If JSON output is truncated, num_predict is too low:

Got: {"category": "bug", "details": "This is a significant issue that affects the core authenticat

Fix: Increase num_predict or simplify the requested output.

4. Check Temperature#

Temperature > 0 adds randomness. For extraction and classification, randomness is noise:

# WRONG for structured output
options={"temperature": 0.7}

# RIGHT for structured output
options={"temperature": 0.0}

Use temperature > 0 only for generation tasks where variety is desired (writing, brainstorming).

5. Compare Against a Larger Model#

If the prompt works on 32B but fails on 7B, the task may be too complex for the small model. Options:

  • Simplify the prompt (fewer fields, simpler schema)
  • Split into multiple smaller calls
  • Use the larger model

Prompt Anti-Patterns for Local Models#

Too Many Instructions#

# Bad: 8 instructions, local model loses track after 3
"Analyze this code. Focus on security. Also check performance.
Consider edge cases. Look at error handling. Check naming conventions.
Verify the API contract. Suggest refactoring opportunities."

# Good: one clear instruction
"List security vulnerabilities in this code.
For each: file, line, severity (low/medium/high/critical), description."

Implicit Output Format#

# Bad: model decides format
"What's wrong with this code?"

# Good: explicit format
"List issues in this code as a JSON array:
[{\"file\": \"...\", \"line\": N, \"issue\": \"...\", \"severity\": \"...\"}]"

Negative Instructions#

# Bad: models are weak at "don't"
"Don't include explanations. Don't add commentary. Don't use markdown."

# Good: say what you want, not what you don't want
"Output ONLY the JSON object. No text before or after the JSON."

Asking for Confidence Without Calibration#

Small models are poorly calibrated on confidence. A 7B model saying “confidence: 0.95” does not mean it is 95% likely to be correct. Use confidence scores for relative ranking (higher is more likely correct than lower), not as absolute probabilities.

Testing Prompts#

Never deploy a prompt tested on one example. Build a small test suite:

TEST_CASES = [
    {"input": "...", "expected": {"category": "bug", ...}},
    {"input": "...", "expected": {"category": "feature", ...}},
    {"input": "...", "expected": {"category": "question", ...}},
    # Include edge cases:
    {"input": "(empty string)", "expected": {"category": "other", ...}},
    {"input": "(ambiguous input)", "expected": {"category": "question", ...}},
    {"input": "(very long input)", "expected": {"category": "bug", ...}},
]

def test_prompt(prompt_fn, model, test_cases):
    results = []
    for case in test_cases:
        actual = run_prompt(prompt_fn, model, case["input"])
        match = actual == case["expected"]
        results.append({"input": case["input"][:50], "match": match, "actual": actual})

    accuracy = sum(r["match"] for r in results) / len(results)
    print(f"Accuracy: {accuracy:.0%} ({sum(r['match'] for r in results)}/{len(results)})")
    return results

Test across at least 3 models at the target size tier. A prompt that works on Qwen but fails on Llama is too model-specific — generalize it.

Common Mistakes#

  1. Using cloud-model prompting style on local models. Vague instructions, implicit formats, and conversational prompts work on GPT-4 and Claude. Local models need explicit structure, constrained output, and front-loaded instructions.
  2. Not using the system message. The system message sets the model’s role and behavior. Local models respond noticeably better when given a clear system role (“You are a security auditor”) versus user-only prompts.
  3. Changing prompts based on one failure. A prompt that works 90% of the time should not be rewritten because of one bad output. Test on a suite. If accuracy is below your threshold, then adjust.
  4. Including unnecessary context. Local models have smaller effective context windows. Every token of irrelevant context reduces the quality of the response. Send only what the model needs.
  5. Expecting chain-of-thought from 3-4B models. Small models cannot reliably reason through multiple steps. If the task requires reasoning, either use a larger model or decompose the task into sequential calls where each step is simple.