Closed-Loop DONE for Autonomous Agent CI/CD: Why 'PR Opened' Is Not Shipped

A backlog item flips to status='completed' in the database. The dashboard ticks up. The agent posts “PR ready for review” and walks away. Three hours later, a different agent notices the fleet is running yesterday’s binary. The PR was never reviewed. CI was red on main. No image got built. Nothing actually shipped.

This is the closed-loop problem. When an autonomous agent declares work complete, what does “complete” mean? In most agent fleets, it means the agent called the last tool in its own workflow — typically open_pr or its equivalent. That is not the same as “the change is live for users”, and the gap between the two is where state-of-record systematically lies.

The unfinished pipeline#

A typical agent-driven change has more stages than a human-driven one, because the agent only owns the first stage:

[1] Agent writes code      → owned by builder agent
[2] PR opened              → owned by builder agent      ← most fleets stop counting here
[3] PR reviewed            → owned by reviewer agent(s)
[4] PR merged              → owned by architect / human / autonomous-merge logic
[5] Post-merge CI runs     → owned by CI system
[6] Image built + pushed   → owned by CI system / registry
[7] Pods rolled            → owned by deploy controller / k8s
[8] Health check verified  → owned by health monitor

Each stage is a state gate. Each gate has a different owner. Each gate can fail silently. And the more autonomous the system, the more of these gates run without a human-in-loop noticing when one stalls.

The default failure mode is: stages 1–2 succeed in a tight loop, stages 3–8 are best-effort, and “DONE” gets recorded against the wrong stage. Production dashboards over-report velocity by hours.

Why this happens#

When agent runtimes were first designed, the implicit assumption was that downstream stages were boring. Reviewers reviewed promptly. CI passed reliably. Pods auto-rolled on image push. So coupling “DONE” to “agent finished its part” was a reasonable shortcut.

In production those assumptions break in three observable ways:

Reality	Effect on “PR opened = DONE”
Reviewer agents go silent (stuck, paused, stalled)	PR sits unreviewed, item shows completed
CI infrastructure flakes / breaks	Merge happens, image never builds, item shows completed
Deploy pipeline is manual or partial	Image exists, no pod runs it, item shows completed

The lie compounds. Dashboards trust the state field. Velocity metrics derive from it. The architect agent’s cycle summary reads from it. By the time the divergence is noticed, “completed” items are weeks-old PRs that never shipped, and the team has been planning capacity against false throughput numbers.

The closed-loop design#

Three principles, in order of effort to adopt.

1. Enumerate every gate. Assign each an owner. Walk through your pipeline and write down every external system between “agent finished” and “live for users”. Most teams discover 3–4 gates they had not explicitly named. For each gate, ask: which agent or process is responsible for advancing past it? If the answer is “nobody — it just happens”, that gate is your next outage.

2. Make the state field reflect the latest gate passed, not the first. Move the trigger for status='completed' to the last gate you can reliably observe, not the first one the agent controls. For most teams this is “merge confirmed + post-merge CI green for that commit”. Going further (image built, pods rolled) is better but requires more pipeline visibility — pick the strictest gate you can actually watch.

3. Observability beats enforcement. Do not block merges or other transitions on the gates. Some merges legitimately need to land while a downstream stage is red — the PR that fixes broken CI is the canonical example. Block the wrong transition and you cannot recover the loop without manual surgery. Instead, alert when any item sits between gates longer than expected.

Picking the definition#

There are roughly three definitions of DONE worth considering, each strictly stronger than the previous:

Definition	What it asserts	What it misses
A. PR merged	The change is on the target branch	Red main CI; image not built; deploy not run
B. PR merged AND main CI green for that commit	The change builds and tests cleanly on main	Image not built (different stage); pods not on new image
C. PR merged AND image built AND pods deployed	The change is running in production	Health verification (still rare to gate on)

Most teams running an agent CI/CD pipeline are using something weaker than A — they fire DONE on open_pr. Moving to A captures most of the gap without much engineering effort. B is the right phase-one target for teams whose CI is itself reliable enough to make the signal meaningful. C is the long-term shape and requires deploy-stage visibility most teams do not yet have.

The trap to avoid: jumping to C in one move. Items will stack up indefinitely in intermediate states whenever any downstream stage is degraded. Without alerts on each gate (next section), you trade one observability gap for a different one.

Closing each gate#

Each gate needs three properties to be actually closed:

A state value that records whether the gate has been passed (a database column, a label, a stored timestamp).
A writer — the agent or process that advances the state. The writer should be the system closest to the gate (the CI system writes “CI green”, not a poller that infers it).
An alert that fires when the state sits at this gate longer than the expected duration plus a buffer.

For the typical agent pipeline:

Gate                State writer              Alert
─────────────────────────────────────────────────────────────────────
PR opened           Builder agent (open_pr)   no alert (transient)
PR reviewed         Reviewer agent webhook    >2h with no review
PR merged           Merge webhook → watcher   no alert (terminal-ish)
Post-merge CI       CI webhook → watcher      >30min build pending
Image built         Registry webhook          >15min after CI green
Pods rolled         Deploy controller         >10min after image
Health verified     Health check job          >5min after rollout

Each row is a thin component. The watcher pattern (small process listening to webhooks, updating a state row, polling external systems for the next gate) handles 80% of the wiring. The alerts route to whichever channel your team already uses for ops (Mattermost, Slack, PagerDuty).

The merge-while-red exception#

Branch protection on the main branch is the most common knob teams reach for to “enforce” some of these gates. It almost always overshoots.

A CI-required-to-pass rule on main blocks every PR — including the PR that fixes CI itself. When the test suite is broken, the only way out is to merge the test fix without the test suite passing. If branch protection blocks that, you are in a deadlock that requires either disabling protection temporarily (and remembering to re-enable it) or merging via an admin override that bypasses the audit trail you wanted in the first place.

The correct pattern: leave required-checks empty on main. Observe red CI via alert (JenkinsMainBranchCIFailing over 15 minutes, say). Accept that some merges land red. The alert plus the state-of-record fix above ensures red main does not silently rot — somebody is paged, and items waiting on main-CI-green sit in their intermediate state until the next green build.

The same logic applies to other gates. Enforcement is for invariants that should never be violated (no PRs from untrusted users to protected branches, no force-push to main). Health gates are different in kind — they describe a desired-state-not-yet-reached, and forcing transitions through them when they fail just creates more sophisticated workarounds.

Implementation: the watcher pattern#

The cheapest way to close the loop is one or two small watchers that subscribe to existing webhooks and poll Jenkins (or your CI system) every minute. Pseudocode:

# Subscribe to Gitea/GitHub PR-merged events
async def on_pr_merged(event):
    item_id = extract_item_id_from_pr_body(event.pr.body)
    if not item_id:
        return  # not a tracked item
    await db.execute("""
        UPDATE backlog_items
        SET merged_at = $1, merged_commit = $2
        WHERE item_id = $3 AND status != 'completed'
    """, event.merged_at, event.merge_commit_sha, item_id)

# Poll CI for items waiting on main-branch build
async def poll_main_ci():
    waiting = await db.fetch("""
        SELECT item_id, merged_commit, repo
        FROM backlog_items
        WHERE merged_commit IS NOT NULL AND completed_at IS NULL
    """)
    for item in waiting:
        result = await ci.last_main_build_result(item.repo, after=item.merged_commit)
        if result == "SUCCESS":
            await db.execute("""
                UPDATE backlog_items
                SET status = 'completed', completed_at = NOW()
                WHERE item_id = $1
            """, item.item_id)
        elif result == "FAILURE":
            # Don't change state; alert (or rely on the JenkinsMainBranchCIFailing alert)
            pass
        # PENDING / RUNNING: leave for next poll

Two notes on this shape:

The webhook handler is at-least-once but not guaranteed. The poll loop is the reconciliation — even if a webhook is dropped, the next poll catches the merged commit via API query.
Items that get stuck in the intermediate state (merged_at set, completed_at null) are themselves a signal. A query for “items merged >24h ago not completed” surfaces stuck pipelines without needing a separate alert.

For the deploy and health-verify gates, the same pattern repeats — subscribe to image-pushed events, watch deployment-status events, etc. Each gate adds one small component. The architecture stays simple as long as each watcher only touches its own gate.

Migration: handling existing “completed” items#

A team that has been firing DONE early has a backlog of items that are nominally completed but were never actually shipped, plus a batch of items that did ship but predate the new tracking columns. Two migration strategies:

Backdate everything. Set merged_at = completed_at = updated_at for all status='completed' items. Loses the actual merge timestamps but unblocks dashboards immediately. Use when you don’t have a way to recover real merge times.

Take the cliff. Run the migration that adds the new columns and leaves them null on existing items. Metrics like “items completed per day” will appear to crash for the first 24h of the new regime. This is honest — the previous metric was over-counting. Communicate the cliff before deploying.

A third option some teams try — recover real merge times from git log or CI history — is more accurate but typically not worth the engineering effort. The cliff is fine if you announce it.

Dashboards that get less satisfying#

Closing the loop honestly will make several dashboards look worse before they look better.

Velocity (items completed per period) drops. The previous number was over-counted; the new number is real. Document the change so it does not get pattern-matched as a regression.
Time-to-completion (open → completed) increases substantially. It now reflects real DONE, not optimistic DONE.
A new metric appears: items in intermediate states (merged but not completed). This is the new safety surface. Watch for the count growing — that means downstream is degrading.

The first week is uncomfortable. Then the dashboards become trustworthy and the team can plan capacity against actual throughput instead of declared throughput.

What this is not#

This pattern is not about replacing humans-in-loop with autonomous merge. Some changes should require human approval before merging, period. Closing the loop after merge is orthogonal to whether merge itself is autonomous.

It is also not a replacement for code review. Reviewers should still catch broken changes before merge. The loop closure exists to surface the cases where review missed something AND CI catches it AND the fleet was still stuck on the old image.

And it is not a one-time project. New gates appear as the pipeline evolves — a security scan stage, a canary deploy, a feature-flag rollout. Each new gate needs the same three properties (state value, writer, alert) added at the time of introduction. The discipline of “closing the loop” is something you maintain, not something you finish.

Where to start#

Pick the smallest version that closes the largest gap.

For most teams that gap is stages 3–5 (review → merge → main CI). The watcher pattern is one small service, no schema explosion, no enforcement risk, and it catches the failure mode that most often leaves the fleet running stale binaries. Image build and deploy gates are next, but only once the merge-CI loop is reliable. Health verification is the limit case — most teams find it overkill.

The framing question to ask in any planning meeting: “If our most autonomous agent declared its work DONE right now, how many systems would have to succeed downstream before our users actually saw the change? Which of those systems are we relying on, but not watching?” The number is almost always higher than people expect, and the gap between “owned” and “watched” is where the loop needs closing.