An agent fleet wired to a high-volume trigger source — channel mentions, queue events, webhooks — pays full cost on every cycle, even when the trigger is noise. A classifier placed in front of the main agent decides which triggers deserve a real cycle and which to drop. The pattern is old; what is new is that local LLMs make the classifier cost effectively zero, which flips the arithmetic in the pattern’s favor for cases that previously didn’t justify the latency.
The pattern in one diagram#
Trigger event (mention, queue item, webhook, etc.)
│
▼
┌─────────────────┐
│ Wake-filter │ ← cheap classifier
│ (small model) │ "Should the main agent run on this?"
└─────────────────┘
│
┌─────────┴─────────┐
▼ ▼
classify=wake classify=skip
│ │
▼ ▼
┌──────────────┐ (no-op; cycle ends)
│ Main agent │ ← expensive call
│ (frontier) │
└──────────────┘The classifier is a single LLM call with a focused prompt and a structured output. The main agent is the full reasoning loop — multi-turn, tool-calling, expensive. The wake-filter’s job is to spend a few cents (or less) to decide whether to spend dollars.
Cost arithmetic#
The pattern only pays off when the math works. Three regimes, with signal_rate = the fraction of triggers that are real work:
| Scenario | Per-trigger cost (no wake-filter) | Per-trigger cost (with wake-filter) | Verdict |
|---|---|---|---|
| Frontier-API main + frontier-API filter | $X | $X × signal_rate + $small | Improves only if classifier is meaningfully cheaper than main |
| Frontier-API main + local filter | $X | $X × signal_rate + ~$0 | Wins significantly when signal_rate < 50% |
| Local main + local filter | ~$0 | ~$0 | Latency optimization, not cost |
The middle row is where the local-LLM era flips the decision. Before cheap local inference, a wake-filter’s per-call cost ate into the savings; the pattern paid off only when noise was overwhelming and the main agent was extremely expensive. With a local classifier the per-call cost rounds to zero, so the break-even shifts dramatically: even a 30% noise rate justifies the pattern.
This flip is the spine of the broader local-vs-API decision. See the companion piece, Local LLM Cost-Capability Tradeoff, for the general framing — the wake-filter is the strongest specific case where local wins outright on a workload the API would otherwise dominate.
The pattern wins when:
- The trigger source is mostly noise (broad channel mentions, fan-out subscriptions)
- The main agent is frontier-API priced (per-cycle cost is meaningful)
- A local LLM is available for classification (per-call cost ~$0)
- Classifier latency fits within end-to-end SLO
The pattern doesn’t win when:
- The trigger source is already curated (every backlog assignment is a directive — there is nothing to filter)
- The main agent is itself cheap (local-on-local just adds latency)
- End-to-end response is latency-critical and the classifier round-trip pushes past budget
Implementation: Ollama OpenAI-compat#
A working wake-filter is a single LLM call with three properties: deterministic output (temperature: 0), structured response, and a focused prompt. The OpenAI-compatible Ollama endpoint makes the integration boilerplate minimal.
// Adapt to the host runtime; the shape is the same in any language.
type WakeFilter interface {
Classify(ctx context.Context, event Event) (wake bool, reason string, err error)
}
type LocalWakeFilter struct {
client *openai.Client // OpenAI-compat client pointed at Ollama (host:11434/v1)
model string // "gemma4:e4b" or another small fast model
promptTpl string // System prompt: "You are a wake-filter for agent X. Decide..."
}
func (l *LocalWakeFilter) Classify(ctx context.Context, e Event) (bool, string, error) {
resp, err := l.client.CreateChatCompletion(ctx, openai.ChatCompletionRequest{
Model: l.model,
Messages: []openai.ChatCompletionMessage{
{Role: "system", Content: l.promptTpl},
{Role: "user", Content: fmt.Sprintf("Event: %s\nDecide wake/skip with reason.", e.Summary)},
},
Temperature: 0,
ResponseFormat: &openai.ChatCompletionResponseFormat{Type: "json_object"},
})
if err != nil {
// On classifier infrastructure failure, default to wake (false-positive bias).
return true, "classifier_unavailable", err
}
var decision struct {
Wake bool `json:"wake"`
Reason string `json:"reason"`
}
if err := json.Unmarshal([]byte(resp.Choices[0].Message.Content), &decision); err != nil {
return true, "parse_failure", err
}
return decision.Wake, decision.Reason, nil
}The host running Ollama owns the latency budget. A small model like gemma4:e4b typically returns a JSON decision in under a second on consumer hardware; that fits inside most agent cycle SLOs. If the classifier service becomes unreachable, the integration above defaults to wake — wasting a cycle is recoverable; missing real work usually is not.
The classifier prompt is the part that takes iteration. A vague prompt like “Should the agent wake?” is too abstract for a small model. A concrete prompt — “Is this message a directive that requires the PM agent to dispatch a backlog item, or is it a status update?” — gives the model criteria it can apply consistently. Wake-filter prompts must be concrete, role-specific, and grounded in examples. A wake-filter for a code-review agent classifies differently from one for a triage agent; the prompts should not be shared.
A working prompt has four parts. First, a one-sentence role definition that names the agent the filter is gating. Second, a list of wake criteria — concrete patterns that should always trigger the main agent. Third, a list of skip criteria with examples of the noise to drop. Fourth, an explicit JSON schema for the output. The order matters: small models anchor on the role, then check the wake list, then the skip list, and finally produce the structured response.
You are a wake-filter for the PM agent. The PM dispatches backlog
items to builders when given a directive.
Wake the PM if the event is:
- An explicit dispatch instruction ("assign X to Y", "queue this")
- A new backlog item that needs triage
- A direct mention with an actionable request
Skip the PM if the event is:
- A status update or progress report (no action requested)
- A message tagged at another agent (PM is not the addressee)
- A heartbeat, log line, or routine system message
Respond with JSON: {"wake": <bool>, "reason": "<short phrase>"}The skip list is what saves the most cost. Most production noise comes from a small number of recurring patterns — heartbeats, automated reports, broadcast announcements. Naming those patterns explicitly in the skip list teaches the small model to recognize them at a glance, without the model needing to reason about why they’re noise.
Eval discipline#
A classifier without an eval harness is a silent liability. Prompt edits, model upgrades, and upstream trigger changes all drift the decision boundary. Without a corpus to regress against, the drift becomes visible only after production breaks — either a cost spike (false positives) or a backlog of missed work (false negatives).
The minimum viable harness:
eval/wake-filter/
├── corpus.jsonl # 20-50 hand-labeled examples: {event: ..., expected_wake: bool}
├── runner.py # iterates corpus, calls classifier, computes precision/recall
└── results-2026-05-07.md # snapshot per evaluation runThe corpus should cover every cluster of triggers the production stream produces, with a deliberate mix of obvious-wake, obvious-skip, and ambiguous edge cases. New trigger patterns observed in production should be added to the corpus with their correct labels — this is how the harness stays honest as the trigger source evolves.
Pre-merge requirement: the classifier must hit a target accuracy on the corpus before the prompt or model change deploys. Some teams use 95% as the bar; others require 100% on small corpora because every miss is a known production scenario the agent will see. One reference deployment uses 100% on a 23-case corpus as the merge gate — that specific number isn’t universal, but the discipline is: without an eval harness, classifier drift is invisible until production breaks.
Failure modes and the false-positive bias#
A wake-filter has two failure modes, and they are not symmetric.
| Failure | What it costs | Recovery | Bias toward |
|---|---|---|---|
| False positive (wake on noise) | One wasted main-agent cycle | Automatic next cycle | acceptable |
| False negative (skip real work) | Missed work, user-visible delay | Manual escalation or next trigger | avoid |
A wasted cycle costs the price of one main-agent invocation and is recoverable on the next trigger. A missed cycle hides real work — it doesn’t surface until a user notices something is overdue. Tune the classifier prompt for false-positive bias: when the model is uncertain, default to wake. This is the right asymmetry for almost every wake-filter use case.
The same bias governs error handling in the integration code. Network errors talking to the classifier, malformed JSON in the response, model timeouts — every one of these should default to wake. The classifier exists to save money on the easy-to-classify majority; the hard cases should fall through to the main agent that can actually reason about them.
Debugging classifier failures#
Three failure signatures appear in production wake-filters often enough to recognize on sight.
Malformed JSON output. Small models occasionally drift from the requested format, especially after a prompt edit or model upgrade:
Expected: {"wake": true, "reason": "explicit dispatch directive"}
Got: The answer here is to wake because this looks like...The mitigation has two layers. First, request response_format: {type: "json_object"} if the model and inference server support it — this enforces JSON at decode time. Second, wrap the parser in a fallback that defaults to wake on parse failure. A classifier that occasionally produces prose is recoverable; one that produces prose AND blocks the main agent is not.
False-positive cluster. Suddenly the main agent runs on everything; cost spikes; the alerts queue fills with low-value cycles.
- Diagnostic: dump the last 100 classifier decisions. Look for
wake: true, reason: ""(an empty reason usually means the classifier punted to the safe default). - Causes: prompt edit that loosened criteria, corpus drift that no longer matches production triggers, or an upstream change to the trigger source that introduced new event shapes the prompt doesn’t recognize.
False-negative cluster. Real work backs up; users escalate manually because the agent went quiet.
- Diagnostic: dump the last 100 classifier decisions. Look for
wake: falseon items that look like real work in retrospect. - Causes: prompt overfit to skip examples, an unbalanced corpus that taught the classifier to be too aggressive, or a model upgrade that shifted the decision boundary.
Eval regression at merge time. The harness is the cheapest place to catch a problem:
eval pre-merge required: 100% on corpus
actual: 22/23 (95.7%)
diff: case "X" now classified as skip; was wake beforeBlock the deploy. The root cause is almost always either a prompt edit that changed semantics in a way the author didn’t anticipate, or a model upgrade that moved the boundary on a borderline case. Either way, the corpus is doing its job — investigate before merging, not after.
Operational tips#
A few things worth doing on day one rather than discovering through outages.
Log every decision. {event_id, wake, reason, model, prompt_version, latency_ms} per classification. Cost-spike investigations and accuracy audits both need this data; reconstructing it after the fact is impossible.
Version the prompt. Treat the wake-filter prompt as code — store it in a file, commit it, tag deploys with the version. When false positives or negatives spike, the first question is “did the prompt change recently?” and the answer should be a one-line check.
Pin the model. gemma4:e4b is not the same classifier as gemma4:e4b-instruct-q4 — model tags drift, and Ollama will quietly pull a new version. Pin the exact tag, and treat a model upgrade as a prompt change for eval-gating purposes.
Re-eval on trigger-source changes. A new event type, a new channel subscription, or a new producer of triggers can blow up classifier accuracy without anything in the wake-filter itself changing. Add representative samples of new event shapes to the corpus and re-run the harness before assuming the classifier still works.
Watch latency, not just accuracy. A classifier that takes three seconds to decide is a classifier that pushed the agent past its SLO. The host running Ollama should be capacity-checked under realistic load — a single stuck request behind a slow classifier batch is a worse outcome than no classifier at all.