Evaluation Pipelines for Multi-Agent Cognition
Production eval loops, drift detection, and the quality-surface discipline.
§ IFrame
A week ago the Mon-α curriculum named model-serving topology — routing, pooling, versioning, cost. Serving names how a request reaches a model. It does not name how the operator knows the model that answered was the right model to answer. That second question is the one this lesson takes.
A model in production answers ten thousand requests a day. Most answers are fine. A small fraction are wrong in ways that matter: a router sent code-review traffic to a model that lost its instruction-following after a fine-tune; a Gemini variant behind one agent began hallucinating dates after a silent vendor upgrade; a retrieval step degraded because the embedding index drifted from the document corpus. None of these failures shows up in latency or error-rate. They show up in the answers themselves.
Evaluation is the layer that watches the answers. It runs continuously, scores each generation against a discipline the operator picked deliberately, and raises a verdict when the quality-surface bends. Done well, eval is the layer that lets the operator change models without holding their breath. Done badly, eval is a spreadsheet of accuracy numbers from three weeks ago that no one reads.
The operator who treats eval as a launch-gate inherits a one-time check that decays the moment it passes. The operator who treats eval as a continuous pipeline inherits a quality-surface that bends visibly before it breaks.
§ IIFoundations
Four primitives carry the bulk of eval pipeline design. Name them; the rest follows.
The four compose into a loop: the eval set drives scorers that feed an aggregator that produces verdicts that change what serves the next request. Each primitive can be cheap or careful; production demands at least one careful instance of each.
§ IIIMechanism
The eval set, in three layers
A useful eval set is not one set. It is three.
The first layer is the regression set — a fixed bank of inputs with known-correct outputs, drawn from real production traffic and labeled once. The regression set answers did this version get worse at what the prior version did well. Small (200–2,000 inputs typical), labeled carefully, frozen by convention, re-run on every candidate before live traffic.
The second layer is the canary set — a rolling window of recent production inputs with no labels, sampled uniformly across request types. The canary set answers what is the model actually doing right now, on the traffic that exists now. Inputs are scored by property-based or LLM-as-judge scorers; outputs feed drift detection.
The third layer is the adversarial set — a curated bank of inputs designed to probe known failure modes: jailbreak prompts, edge-case formats, requests near policy boundaries, inputs that surfaced bugs in prior versions. The adversarial set answers can the model still refuse what it must refuse, and still handle what it must handle. It grows over time; every production failure that escapes to a user gets added.
These three are not interchangeable. A model that aces the regression set can still drift on the canary set. A model clean on both can still fail on the adversarial set. Production eval runs all three on every candidate, weights them by what the operator cares most about, and exposes per-layer scores separately so a regression in one is not hidden by stability in another.
Scorers chosen by what they answer
Reference-based scorers are the cheapest and the most brittle. Exact-match works for closed-form outputs (a date, a structured object, a single classification label) and fails wherever the model is free to phrase the answer multiple correct ways. Semantic-similarity scorers cover that phrasing freedom but introduce a second model into the eval pipeline — its biases become eval biases. Use exact match where the answer space is finite; reach for semantic similarity only when the looseness matters and the second model is one the operator has separately validated.
Property-based scorers are where production eval lives. Output is valid JSON. Output contains at least one citation. Output does not contain the user's private data. Output refuses requests in the prohibited list. Each property is a predicate that costs nothing to evaluate and catches a specific class of failure. A production system runs dozens of property-based scorers per output; they compose by AND for hard requirements and weighted average for soft preferences.
LLM-as-judge scorers handle the cases nothing else can — is this code review actually helpful, does this summary preserve the source's stance, is this agent's plan coherent with its stated goal. The judge is another model with a rubric prompt. The pattern works when the rubric is specific enough that two humans would agree with the judge most of the time. The pattern fails when the rubric is vague enough that the judge's biases dominate. The discipline is to validate the judge on a human-labeled subset before trusting it at scale.
The aggregator as the operator's view
A scorer that prints a number to a log file is not useful. The aggregator is the layer that turns per-input scores into the per-version metrics an operator can read at a glance.
The aggregator slices three ways. First, by version — every score carries the model version, the prompt version, and the retrieval index version that produced it; aggregates roll up per-triple so a regression points at which component changed. Second, by request type — code review, summarization, agent planning, refusal. A single mean across all types hides the case where one type degraded while three improved. Third, by time window — last hour, last day, last week — so a slow drift is visible before it becomes a sharp drop.
The aggregator also computes confidence. A regression set of 200 inputs scored at 0.84 mean does not tell the operator whether 0.84 is meaningfully different from the prior 0.86. Bootstrap confidence intervals on the difference, computed at the aggregator, do.
Verdict surface and the promotion gate
The fourth primitive is where eval becomes control rather than observation. Three verdict patterns appear in practice.
Promotion gating. A candidate model version cannot move from a shadow endpoint to receive a percentage of live traffic until the regression set score is within tolerance of the prior version and the adversarial set score is at or above the prior. A candidate that fails either gate stays in shadow until the operator either acknowledges the regression or fixes it.
Rollback triggering. A live version that drifts beyond a threshold on the canary set within a sustained window gets a verdict that triggers either an automatic rollback to the prior version or a paged alert to an operator. The choice between automatic and human-gated rollback is policy: automatic catches fast failures, human-gated avoids over-reacting to short blips.
Freeze on adversarial regression. Any adversarial-set regression — a previously-refused prompt now answered, a previously-handled edge case now broken — freezes the version from further traffic increases until reviewed. Adversarial regressions get the strictest verdict because they signal capability or safety loss, not stochastic noise.
§ IVWorked Example
A code-review agent in production. The pipeline routes pull-request diffs to one of three models — a fast small model for short diffs, a stronger model for long ones, a code-specific fine-tune for security-sensitive paths.
The operator wants to upgrade the small model to a newer vendor release.
The regression set is 400 historical PR diffs labeled by senior engineers with the categories helpful, partially helpful, missed-the-point, incorrect-claim. Each diff was reviewed both by the prior small model and by a senior engineer; the engineer's verdict is the gold label.
The canary set samples 200 production diffs per hour and runs three property scorers: output references at least one specific line by number, output does not claim a bug exists where the diff did not change the relevant lines, output is under 600 tokens. The canary aggregator slices by repository.
The adversarial set is 80 hand-curated diffs: deliberately confusing diffs, diffs that look like they have bugs but do not, diffs from languages the prior version handled poorly, diffs containing prompt-injection attempts disguised as comments.
The candidate version is loaded behind a shadow endpoint per the prior serving lesson. The eval pipeline runs all three sets against shadow. The regression set returns 0.82 mean — prior was 0.84, the bootstrap 95-percent confidence interval crosses zero; the gate passes. The canary scorers run on shadow's outputs against the same recent inputs the live version sees; the line-reference rate dropped from 92 percent to 78 percent. The adversarial set surfaces two regressions: one prompt-injection diff is now followed rather than refused; one historically-confusing diff produces a confident-but-incorrect verdict.
Verdict: the candidate cannot promote. The line-reference regression and the prompt-injection failure both fall in the freeze category. The pipeline pages the model-team lead with the specific failures, links the diffs, and shows the score deltas. The operator does not deploy until the regressions are addressed, and the regressions are addressable precisely because the eval pipeline named them before any user saw them.
§ VConnection to Prior Lessons
The W2-Mon lesson on model-serving topology named the four primitives of how a request reaches a model: routing, pooling, versioning, cost. This lesson is the layer the version primitive plugs into. Versioning makes blue-green and canary deployment possible; eval is what decides whether the blue-green flip happens or the canary stays at one percent.
The W1-Tue lesson on observability for multi-agent LLM systems named the three-level telemetry pattern: request traces, agent-step spans, model-call attributes. Eval consumes the model-call attributes (which version, which prompt template, which retrieval result) and joins them to outputs so the aggregator can slice by exactly the triple that produced each score.
The W1-Thu lesson on agent memory layers named persistent recall as the layer that lets agents carry state across sessions. Eval extends to the memory layer: a memory-write that introduces a stale fact is a regression that property-based scorers can catch, where the property is agent does not assert facts contradicted by source documents from the current week.
The three lessons compose. Observability sees what happened. Memory remembers what mattered. Eval judges whether what happened was good enough to keep doing.
§ VIConnection to Today's Dev Lesson
The Rust Dev lesson today encodes the scorer-composition pattern in types. Where this Ops lesson named property-based scorers as functions that take an output and produce a verdict, the Rust lesson shows how a stack of scorers composes through trait objects and async streams: a stream of model outputs flows through a chain of scorers, each scorer is a trait object the chain holds opaquely, the chain produces a stream of per-input score records that the aggregator consumes. The Rust idiom makes the eval pipeline itself a typed graph: a scorer with the wrong output shape will not link.
The lesson also shows the aggregation step as a fold over the verdict stream, with the aggregator's per-version, per-request-type slicing implemented as a typed key into a HashMap whose values are themselves typed score-accumulators. The shape this Ops lesson named in prose lands in Rust as a graph the compiler verifies.
§ VIIClosing
Evaluation is the layer that keeps versioning honest. Without it, every model release is a leap of faith and every silent vendor upgrade is a roll of the dice. With it, the operator has a quality-surface that bends visibly before it breaks, a verdict surface that turns observation into control, and a discipline that survives the next model change.
Three layers of eval set. Three families of scorer. One aggregator that slices what matters. One verdict surface that closes the loop.
Examine the eval pipeline well. The model you serve tomorrow is the one its verdicts let through today.
Fajr 2026-06-01 — Ops lesson #9 in the curriculum spine; #5 in α-Cognition.