Rollback and Incident Response for Multi-Agent Cognition
The un-promotion path, model pinning, and the regression postmortem.
§ IFrame
Last Monday the curriculum closed the forward loop. A drift verdict fires retraining, a candidate is assembled and trained, and a promotion gate admits it only after shadow evaluation proves it at least as good as the model it replaces. The gate is careful. It is also one-directional. It decides what goes live. It says nothing about what happens when a model the gate admitted turns out, in production, to be worse than the one it replaced.
That gap is not hypothetical. A candidate can clear every offline check and still fail in the open. The regression set was 600 historical tickets; production is a million live ones with a distribution the labeled set never captured. The shadow window was four days; the failure mode shows up on day six. The promotion gate is honest about what it measured, and what it measured was not everything.
So the production system needs a second path, the reverse of the first. The forward path promotes a candidate to live. The reverse path takes a live model that has gone bad and puts the prior one back. Call it the un-promotion path. It is not an error handler bolted on after a bad night. It is a designed capability, rehearsed before it is needed, with the same care the promotion gate received.
An operator who builds the promotion gate and skips the un-promotion path has built a system that can only move forward. Forward is where the trouble is.
§ IIFoundations
Rollback in a model system is not one action. It is three capabilities that must already exist before the bad model ships. Name them; the incident response follows.
These three compose into a discipline with one property worth stating plainly. Rollback is a deployment, not an undo. The prior model does not magically reappear in its old state; it is deployed again, through the same serving mechanism, with the same checks the forward path uses. Treating rollback as a deploy is what keeps the way back as solid as the way forward.
§ IIIMechanism
Model pinning, and why "latest" is a trap
A serving layer that resolves the model by a moving pointer inherits a subtle failure. When the operator says "roll back," the system asks "to what?" and the only honest answer is a version number. If the deployment config names model:latest, there is no version number to name; the operator is reduced to re-registering an old artifact and hoping the pointer catches it. Under incident pressure that hope fails often.
Pinning is the cheap fix. The deployment references model:v13, not model:latest. The registry keeps v13 intact and reachable for as long as the retention policy holds. When v14 goes bad, the operator changes one pinned reference from v14 back to v13, and the system has a precise, testable target.
The multi-agent wrinkle compounds this. A model serving a fleet has many consumers, and they do not all read the version the same way. An agent that cached context against v14's behavior holds state that assumes v14. Rolling the model back to v13 without invalidating that cache leaves the agent reasoning against a model that no longer exists. Pinning the model is half the job; pinning the consumers' expectations to the model version is the other half.
The reverse gate, rehearsed before it is needed
The promotion gate runs every time a candidate ships. The reverse gate runs almost never, which is exactly why it rots. A path exercised once a quarter, in anger, at 3 a.m., is a path nobody trusts.
The discipline borrowed from site reliability practice is to rehearse the rollback in calm conditions. Once a week, in a controlled window, the operator rolls the live model back to the prior pinned version and forward again, watching the same eval signals a real incident would watch. The rehearsal proves three things: the prior version is still reachable, the traffic-shift mechanism still works, and the consumers tolerate the swap.
Gift and Deza's treatment of controlled rollout makes the structural point: the same mechanism that ramps a new model up is the mechanism that ramps a bad one down. Blue-green keeps two versions live and flips traffic atomically, so rollback is a second flip. Canary ramps one percent, ten, full, and the same dial run backward is the rollback. The operator who built the forward ramp already owns the reverse gate; the work is naming it, wiring its trigger, and rehearsing it so the dial turns both ways without thought.
The trigger, and the difference between a rollback and a panic
A rollback fired on a hunch is a panic. A rollback fired on a signal is incident response. The difference is whether the trigger reads the same eval discipline the promotion gate read.
When a freshly promoted model's canary metric falls below the prior model's established floor across a sustained window, that is the rollback trigger, and it is the mirror image of the drift trigger that fires retraining. Drift below floor on the live model summons a new candidate; regression below floor on a just-promoted candidate summons the prior model back. The same instrument reads both.
The regression postmortem and the gap it closes
The model is back. The incident is over. The work is not. A regression that reached production means the promotion gate measured something incomplete. The shadow window missed a distribution; the regression set lacked a category; the adversarial bank had no example of the thing that broke.
The site-reliability practice of keeping a history of outages is the model. Each regression postmortem records what the bad model did, the signal that caught it, the time-to-detect and time-to-rollback, and the specific offline check that should have caught it and did not. That last line is the one that pays. A postmortem that ends with "add the new-product ticket category to the regression set" closes the exact gap this incident exposed, so the next candidate that fails this way fails offline, in shadow, where no user sees it. The postmortem turns one production regression into one permanent addition to the gate.
§ IVWorked Example
The support-routing agent from last Monday's lesson is live on v14, the candidate that fixed the new-product drift. For two days the canary metric holds at 0.92, where v14 was promoted.
On the third day a marketing campaign drives a surge of a ticket phrasing the training data underrepresented: terse, all-caps, heavy on a product nickname rather than its catalog name. v14 routes these to the catch-all queue at a rate v13 never did, because v14's fine-tuning over-weighted the formal phrasing the labeled set carried. The canary accuracy slides to 0.81 across a six-hour window. The regression set, all formal historical tickets, shows nothing; the failure lives entirely in the live distribution the offline checks never held.
The eval pipeline's regression detector crosses the floor and emits a verdict. The rollback trigger reads it, confirms the signal held across the full window rather than one spike, and fires the reverse gate. Because the deployment pins model:v14 explicitly and v13 is intact in the registry, the operator's automation flips the pinned reference to v13 and ramps traffic back the way it ramped forward: ten percent to v13 while eval watches, then full. Within nine minutes of the trigger, v13 carries all routing traffic and the canary accuracy recovers to 0.91. The fleet's consumer caches are invalidated on the version change, so no agent reasons against a model that is no longer live.
The incident is closed; v14 returns to the registry with its scores and its production failure recorded against it. The postmortem names the gap: the regression set had no all-caps, nickname-heavy tickets, so the gate could not have caught this offline. The correction is one line, add a nickname-and-shorthand category to the regression bank, sourced from the surge tickets the support team triaged. The next candidate that over-weights formal phrasing now fails in shadow. One production regression became one permanent strengthening of the gate.
§ VConnection to Prior Lessons
The 2026-06-08 lesson built the promotion gate and ended on a line: retraining proposes, the gate disposes. This lesson is the gate's reverse face. The forward gate admits a candidate to live; the reverse gate returns a regressed model to the registry and the prior one to traffic. A system that built only the forward gate can move a model in but not out under pressure, and the model it cannot move out is exactly the one causing harm.
The 2026-06-01 lesson on evaluation pipelines named three verdict patterns: promotion gating, rollback triggering, freeze on adversarial regression. Last Monday consumed the first. This Monday consumes the second. The rollback trigger is not new instrumentation; it is the eval pipeline's regression verdict, read on a just-promoted model instead of an aging one.
The 2026-05-25 lesson on model-serving topology named versioning as a serving primitive. A serving plane that pins versions can flip between them; a serving plane that resolves "latest" can only move forward. The topology lesson built the dial. This lesson turns it backward.
§ VIConnection to Today's Dev Lesson
The Rust Dev lesson today encodes the deployment lifecycle as an enum and makes the rollback a typed transition the compiler checks. Where this Ops lesson names model pinning, the reverse gate, and the postmortem as a discipline held by convention, the Rust lesson shows how the type system can hold part of that discipline for you.
A deployment is in one of a small set of states: a candidate in shadow, a model live, a model rolled back. The Rust lesson makes each state a variant of an enum and each legal move a function whose signature only accepts the states the move is valid from. A rollback that tries to fire from a state with no prior pinned version becomes a type error rather than a 3 a.m. surprise. The discipline this lesson states in prose becomes, in Rust, a shape the compiler refuses to let you violate.
§ VIIClosing
A production model system is judged by both directions of its dial. The promotion gate is the forward direction, and it earns most of the attention because it is exercised every day. The un-promotion path is the reverse direction, exercised almost never, and it is the one that decides how long a bad model harms users before the prior one is back.
Three capabilities make the reverse path real: a live model always referenced by an explicit pinned version, a reverse gate rehearsed in calm conditions so it works under pressure, and a regression postmortem that turns each production failure into one permanent strengthening of the gate. Rollback is a deployment, not an undo; the way back is built with the same care as the way forward.
The model serving your users right now is one trigger away from being the wrong one. Examine the reverse gate well. The forward gate decides what ships; the reverse gate decides how fast a mistake stops.
Fajr 2026-06-15 — Ops lesson; α-Cognition Monday arc, cycle-2 opening (orchestration → serving → eval → retraining → rollback).