Structured Output from Small Local Models#

Small models (2-7B parameters) produce structured output that is 85-95% as accurate as cloud APIs for well-defined extraction and classification tasks. The key is constraining the output space so the model’s limited reasoning capacity is focused on filling fields rather than deciding what to generate.

This is where local models genuinely compete with — and sometimes match — models 30x their size.

JSON Mode#

Ollama’s JSON mode forces the model to produce valid JSON:

import ollama

response = ollama.chat(
    model="qwen2.5-coder:7b",
    messages=[{
        "role": "user",
        "content": """Extract the following fields from this support ticket as JSON:
- category (one of: billing, technical, account, other)
- priority (one of: low, medium, high, critical)
- summary (one sentence)

Ticket: "My credit card was charged twice for the same order #12345.
I need an immediate refund for the duplicate charge of $49.99."
"""
    }],
    format="json",
    options={"temperature": 0.0, "num_predict": 512},
)
print(response["message"]["content"])

Output:

{
  "category": "billing",
  "priority": "high",
  "summary": "Customer was charged twice for order #12345 and needs a refund of $49.99."
}

The Token Runaway Problem#

Small models in JSON mode can enter a loop where they generate thousands of repetitive tokens — repeating fields, nesting infinitely, or producing valid-looking JSON that never terminates.

{"category": "billing", "priority": "high", "summary": "...",
 "details": {"category": "billing", "priority": "high", "summary": "...",
  "details": {"category": "billing", ...

The fix is num_predict. Always set a maximum output token limit:

response = ollama.chat(
    model="qwen2.5-coder:7b",
    messages=[...],
    format="json",
    options={
        "temperature": 0.0,
        "num_predict": 1024,  # CRITICAL: cap output tokens
    },
)

For extraction tasks, 256-1024 tokens is almost always sufficient. For complex multi-field schemas, 2048 may be needed. Never leave num_predict unlimited with small models in JSON mode.

Schema-in-Prompt Pattern#

The most reliable pattern for structured extraction: include the exact JSON schema in the prompt.

SCHEMA = {
    "type": "object",
    "properties": {
        "category": {"type": "string", "enum": ["billing", "technical", "account", "other"]},
        "priority": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
        "summary": {"type": "string", "maxLength": 100},
        "action_required": {"type": "boolean"},
    },
    "required": ["category", "priority", "summary", "action_required"]
}

prompt = f"""Extract information from the following text.
Return a JSON object matching this schema:
{json.dumps(SCHEMA, indent=2)}

Text: {input_text}"""

Including enum values in the schema dramatically improves accuracy. The model selects from the provided options instead of generating arbitrary values.

Structured Extraction#

Invoice Parsing#

def extract_invoice(text: str) -> dict:
    prompt = f"""Extract invoice details from this text as JSON with these fields:
- invoice_number: string
- date: string (YYYY-MM-DD format)
- vendor: string
- line_items: array of {{description: string, quantity: number, unit_price: number}}
- total: number
- currency: string (3-letter code)

Text:
{text}"""

    response = ollama.chat(
        model="qwen2.5-coder:7b",
        messages=[{"role": "user", "content": prompt}],
        format="json",
        options={"temperature": 0.0, "num_predict": 1024},
    )
    return json.loads(response["message"]["content"])

Log Event Parsing#

def parse_log_event(log_line: str) -> dict:
    prompt = f"""Parse this log line into structured JSON:
- timestamp: string (ISO 8601)
- level: string (one of: DEBUG, INFO, WARN, ERROR, FATAL)
- service: string
- message: string
- error_code: string or null
- stack_trace: boolean

Log: {log_line}"""

    response = ollama.chat(
        model="ministral:3b",  # Small model is sufficient for single-line parsing
        messages=[{"role": "user", "content": prompt}],
        format="json",
        options={"temperature": 0.0, "num_predict": 256},
    )
    return json.loads(response["message"]["content"])

Classification and Routing#

Classification is the sweet spot for small models. The output space is small and well-defined.

Multi-Label Classification#

CATEGORIES = ["bug", "feature-request", "question", "documentation", "security"]

def classify_issue(title: str, body: str) -> dict:
    prompt = f"""Classify this GitHub issue. Return JSON with:
- primary_category: one of {CATEGORIES}
- confidence: number between 0.0 and 1.0
- suggested_labels: array of up to 3 strings from {CATEGORIES}

Issue title: {title}
Issue body: {body}"""

    response = ollama.chat(
        model="qwen3:4b",
        messages=[{"role": "user", "content": prompt}],
        format="json",
        options={"temperature": 0.0, "num_predict": 256},
    )
    return json.loads(response["message"]["content"])

Routing with Confidence Threshold#

Use the model’s confidence to route uncertain classifications to a larger model:

result = classify_issue(title, body)

if result["confidence"] >= 0.8:
    # Small model is confident — use its classification
    apply_label(result["primary_category"])
elif result["confidence"] >= 0.5:
    # Medium confidence — use 32B for a second opinion
    result_32b = classify_issue_32b(title, body)
    apply_label(result_32b["primary_category"])
else:
    # Low confidence — flag for human review
    flag_for_review(title, body, result)

Function Calling#

Function calling with small models works when the tool schema is small and well-defined.

TOOLS = [
    {
        "name": "search_docs",
        "description": "Search documentation for a topic",
        "parameters": {
            "query": {"type": "string", "description": "search query"},
            "section": {"type": "string", "enum": ["api", "guides", "faq"]},
        }
    },
    {
        "name": "create_ticket",
        "description": "Create a support ticket",
        "parameters": {
            "title": {"type": "string"},
            "priority": {"type": "string", "enum": ["low", "medium", "high"]},
            "category": {"type": "string", "enum": ["billing", "technical", "account"]},
        }
    },
    {
        "name": "check_status",
        "description": "Check status of an order or ticket",
        "parameters": {
            "id": {"type": "string", "description": "order or ticket ID"},
        }
    },
]

def select_tool(user_message: str) -> dict:
    prompt = f"""Given the user message, select the appropriate tool and fill in its parameters.
Return JSON with:
- tool: the tool name
- parameters: object with the tool's parameters filled in

Available tools:
{json.dumps(TOOLS, indent=2)}

User message: {user_message}"""

    response = ollama.chat(
        model="ministral:3b",
        messages=[{"role": "user", "content": prompt}],
        format="json",
        options={"temperature": 0.0, "num_predict": 256},
    )
    return json.loads(response["message"]["content"])

Small models reliably select the correct tool when there are 3-8 tools with clear descriptions. Beyond 10-15 tools, accuracy drops and you should either use a larger model or pre-filter the tool list.

Scoring Extraction Quality#

Never trust model output without measurement. Score extraction quality with deterministic metrics:

Field-Level Exact Match#

def score_extraction(expected: dict, actual: dict) -> dict:
    scores = {}
    for field, expected_value in expected.items():
        actual_value = actual.get(field)
        if actual_value is None:
            scores[field] = 0.0
        elif isinstance(expected_value, str):
            scores[field] = 1.0 if actual_value.strip().lower() == expected_value.strip().lower() else 0.0
        elif isinstance(expected_value, (int, float)):
            scores[field] = 1.0 if abs(actual_value - expected_value) < 0.01 else 0.0
        elif isinstance(expected_value, bool):
            scores[field] = 1.0 if actual_value == expected_value else 0.0
        else:
            scores[field] = 1.0 if actual_value == expected_value else 0.0

    scores["overall"] = sum(scores.values()) / len(scores) if scores else 0.0
    return scores

F1 Score for Multi-Label#

def f1_score(expected_labels: set, actual_labels: set) -> float:
    if not expected_labels and not actual_labels:
        return 1.0
    if not expected_labels or not actual_labels:
        return 0.0

    true_positives = len(expected_labels & actual_labels)
    precision = true_positives / len(actual_labels) if actual_labels else 0
    recall = true_positives / len(expected_labels) if expected_labels else 0

    if precision + recall == 0:
        return 0.0
    return 2 * (precision * recall) / (precision + recall)

Running a Scoring Suite#

import json

# Load test fixtures
with open("testdata/invoices.json") as f:
    test_cases = json.load(f)

results = []
for case in test_cases:
    actual = extract_invoice(case["input"])
    score = score_extraction(case["expected"], actual)
    results.append({"case": case["name"], "score": score["overall"], "details": score})

avg_score = sum(r["score"] for r in results) / len(results)
print(f"Average extraction accuracy: {avg_score:.1%}")
print(f"Cases below 80%: {sum(1 for r in results if r['score'] < 0.8)}/{len(results)}")

Run scoring suites against every model change, prompt change, and quantization change. Regressions in extraction quality are silent — you will not notice them without automated testing.

Common Mistakes#

Not setting num_predict in JSON mode. The single most common issue with local model structured output. Small models can generate 10,000+ tokens of repetitive JSON without this limit.
Using high temperature for extraction. Temperature adds randomness. Extraction should be deterministic. Always use temperature: 0.0 for structured output tasks.
Providing open-ended schemas without enums. A field like category: string gives the model unlimited options. category: one of [billing, technical, account] constrains it to valid values.
Not validating JSON output. Even with JSON mode, the output may be truncated (hit token limit) or have wrong types. Always parse with error handling and validate against the expected schema.
Testing on one example and deploying. Small models are less consistent than large models. A prompt that works on your test example may fail on edge cases. Build a test suite of 20+ examples covering variations and edge cases.