Hedronite · Polyglot-Dev · Rust × α-Cognition · Mon 2026-06-01 · W3 / C1

Rust's Trait Objects and Async Streams for Composable Eval Pipelines

Type-erased scorers, streaming verdicts, and the aggregation discipline.

Lesson Class: Dev · Rust through α-Cognition

Language Week: Lang W3 of 3 (Mon+Thu W3 → Rust)

First Crossing: Rust × α-Cognition (5th Rust lesson, first α-pair)

Word Count: ~2,500

Paired Ops: Evaluation Pipelines for Multi-Agent Cognition

Paired Cert: Vertex AI Model Monitoring + GenAI Eval (Google)

Discipline: ROD v3 + clean-code-blocks (no inline # comments)

§ IFrame

The four prior Rust lessons in this curriculum all used the type-state pattern. mTLS connection lifecycles, validator signing keys, secret handling, order-lifecycle state machines — each lesson encoded a fixed progression of states known at compile time, and let the compiler refuse transitions that the state machine did not permit.

The eval pipeline today's Ops lesson named is a different shape of problem. The pipeline composes a chain of scorers chosen at configuration time, not at compile time. An operator picks five property-based scorers, two LLM-as-judge scorers, and one reference-based scorer for the regression set today; tomorrow's operator picks a different stack. The set of possible scorers is open: vendors will publish new ones, the team will write new ones, the production catalog grows. Encoding every scorer choice as a distinct type at compile time would fight the problem rather than fit it.

Rust answers this shape with trait objects. A Box<dyn Scorer> is a heap-allocated value whose concrete type is erased at the boundary; the holder calls trait methods on it without knowing whether the value is a JsonValidScorer or a JudgeRelevanceScorer or a third-party CitationScorer shipped by a vendor. The eval pipeline holds a Vec<Box<dyn Scorer>>, walks it for each input, and produces a verdict stream the aggregator consumes.

Today's lesson encodes that pipeline. Three Rust mechanisms compose: the sealed trait that bounds the universe of scorers the pipeline accepts; the trait-object value that lets a heterogeneous stack be held in one collection; the async stream that carries model outputs through the stack and verdicts out the other side.

§ IILanguage Idiom — Trait Objects, Sealed Traits, and Async Streams

The Rust trait-object pattern has three building blocks for this shape.

The first building block is the object-safe trait. A trait is object-safe when its methods can be invoked through dynamic dispatch — no generic methods, no Self in return position (except behind a pointer), no associated constants. Object safety is the gate that lets a trait become dyn. The Scorer trait this lesson defines is built object-safe from the first character.

The second building block is the sealed-trait pattern, familiar from the prior Rust lessons. Sealing the trait against downstream implementation keeps the pipeline's safety contract local: the eval pipeline guarantees its own correctness, but cannot guarantee a third-party scorer respects the contract unless the trait says so. For an open scorer ecosystem the seal is selective — the contract trait is sealed, the vendor trait that depends on the contract is open. This lesson uses a single sealed Scorer trait that vendors implement through a registration shim the executor controls.

The third building block is the async stream. The futures::Stream trait is the async counterpart to Iterator: a stream produces a sequence of values, possibly waiting for I/O between each. The eval pipeline's input is a stream of model outputs that arrive as the model generates them; the pipeline's output is a stream of verdict records that the aggregator folds into metrics. Streams compose: map, filter, fold, buffer — the same combinators an Iterator provides, async.

The three compose. The sealed trait names what a scorer is. The trait object lets the pipeline hold a heterogeneous stack of scorers. The async stream carries data through the stack at the rate the I/O permits.

§ IIICode Worked Example — An Eval Pipeline End-to-End

The eval pipeline lives in a crate with three modules: the scorer contract, the pipeline runtime, and the aggregator. The example below shows the spine; production code adds error handling, metrics emission, and per-scorer timeout discipline that the spine elides.

The scorer contract is a sealed object-safe trait. Each implementation answers one question about one output.

mod sealed {
    pub trait Sealed {}
}

pub struct ModelOutput {
    pub request_id: String,
    pub model_version: String,
    pub prompt_version: String,
    pub text: String,
    pub request_type: RequestType,
}

#[derive(Clone, Debug, PartialEq, Eq, Hash)]
pub enum RequestType {
    CodeReview,
    Summarization,
    AgentPlan,
    Refusal,
}

pub struct Verdict {
    pub scorer_name: &'static str,
    pub score: f64,
    pub passed: bool,
    pub detail: Option<String>,
}

#[async_trait::async_trait]
pub trait Scorer: sealed::Sealed + Send + Sync {
    fn name(&self) -> &'static str;
    async fn score(&self, output: &ModelOutput) -> Verdict;
}

The trait is object-safe because every method's return type is concrete (or Self-free), and async_trait desugars the async method to a method returning Pin<Box<dyn Future>> which keeps it usable as dyn Scorer. The Send + Sync bounds let the trait object cross thread boundaries, which the pipeline needs because each scorer is invoked from an async task that the executor may run on any worker thread.

Three concrete scorers illustrate the three families the Ops lesson named. The first is property-based — checks the output is valid JSON for an agent that promised to emit structured planning.

pub struct JsonValidScorer;
impl sealed::Sealed for JsonValidScorer {}

#[async_trait::async_trait]
impl Scorer for JsonValidScorer {
    fn name(&self) -> &'static str { "json_valid" }
    async fn score(&self, output: &ModelOutput) -> Verdict {
        let parsed: Result<serde_json::Value, _> = serde_json::from_str(&output.text);
        Verdict {
            scorer_name: self.name(),
            score: if parsed.is_ok() { 1.0 } else { 0.0 },
            passed: parsed.is_ok(),
            detail: parsed.err().map(|e| e.to_string()),
        }
    }
}

The second is reference-based — compares the output against a known-correct response for the regression set. The scorer holds the reference internally so the pipeline does not have to thread reference data through every call.

pub struct ExactMatchScorer {
    references: std::collections::HashMap<String, String>,
}
impl sealed::Sealed for ExactMatchScorer {}

#[async_trait::async_trait]
impl Scorer for ExactMatchScorer {
    fn name(&self) -> &'static str { "exact_match" }
    async fn score(&self, output: &ModelOutput) -> Verdict {
        let reference = self.references.get(&output.request_id);
        let matched = reference.map(|r| r == &output.text).unwrap_or(false);
        Verdict {
            scorer_name: self.name(),
            score: if matched { 1.0 } else { 0.0 },
            passed: matched,
            detail: None,
        }
    }
}

The third is LLM-as-judge — invokes a second model with a rubric prompt and parses a numeric score from the response. The implementation holds a client to the judge model; the trait object hides the client behind the trait.

pub struct JudgeRelevanceScorer {
    judge_client: JudgeClient,
    rubric_template: String,
    pass_threshold: f64,
}
impl sealed::Sealed for JudgeRelevanceScorer {}

#[async_trait::async_trait]
impl Scorer for JudgeRelevanceScorer {
    fn name(&self) -> &'static str { "judge_relevance" }
    async fn score(&self, output: &ModelOutput) -> Verdict {
        let prompt = self.rubric_template.replace("{output}", &output.text);
        let judge_resp = self.judge_client.complete(&prompt).await;
        let score = parse_judge_score(&judge_resp).unwrap_or(0.0);
        Verdict {
            scorer_name: self.name(),
            score,
            passed: score >= self.pass_threshold,
            detail: Some(judge_resp),
        }
    }
}

Each scorer is a distinct type with distinct internal state. The pipeline holds them together as trait objects.

pub struct EvalPipeline {
    scorers: Vec<Box<dyn Scorer>>,
}

impl EvalPipeline {
    pub fn new() -> Self { Self { scorers: Vec::new() } }
    pub fn add(mut self, scorer: Box<dyn Scorer>) -> Self {
        self.scorers.push(scorer);
        self
    }
}

The pipeline's run method takes a stream of ModelOutput and returns a stream of (ModelOutput, Vec<Verdict>) pairs. The implementation uses futures::stream combinators to drive each scorer concurrently per output and collect all verdicts before yielding the next item.

use futures::stream::{Stream, StreamExt};
use std::pin::Pin;

impl EvalPipeline {
    pub fn run<'a, S>(&'a self, inputs: S)
        -> impl Stream<Item = (ModelOutput, Vec<Verdict>)> + 'a
    where
        S: Stream<Item = ModelOutput> + Send + 'a,
    {
        inputs.then(move |output| async move {
            let verdicts = futures::future::join_all(
                self.scorers.iter().map(|s| s.score(&output))
            ).await;
            (output, verdicts)
        })
    }
}

The then combinator preserves stream order; for higher throughput at the cost of order, buffered(N) would let the pipeline have N outputs in flight. The choice depends on whether the aggregator's downstream slicing assumes order or treats each verdict independently.

The aggregator folds the verdict stream into per-version, per-request-type accumulators. The accumulator key is the triple the Ops lesson named — model version, prompt version, request type.

#[derive(Eq, PartialEq, Hash, Clone)]
pub struct SliceKey {
    pub model_version: String,
    pub prompt_version: String,
    pub request_type: RequestType,
}

#[derive(Default)]
pub struct ScoreAccumulator {
    pub count: u64,
    pub sum: f64,
    pub passes: u64,
}

impl ScoreAccumulator {
    pub fn record(&mut self, v: &Verdict) {
        self.count += 1;
        self.sum += v.score;
        if v.passed { self.passes += 1; }
    }
    pub fn mean(&self) -> f64 {
        if self.count == 0 { 0.0 } else { self.sum / self.count as f64 }
    }
    pub fn pass_rate(&self) -> f64 {
        if self.count == 0 { 0.0 } else { self.passes as f64 / self.count as f64 }
    }
}

The aggregator's map keys are the triple plus the scorer name, so the verdict surface can ask what is the judge_relevance score for code_review on model_version v2.1 with prompt_version p7. Each key holds a typed accumulator with mean and pass-rate methods. The dashboard layer reads the map and produces the per-slice metrics the operator sees.

Composition Three pieces, composed: the sealed trait, the trait-object collection, the streaming runtime. The pipeline accepts new scorers without recompiling the runtime, walks each output through every scorer concurrently, and produces a verdict stream the aggregator can fold into typed metrics.

§ IVConnection to Today's Ops Lesson

The Ops lesson named four primitives: eval set, scorer, aggregator, verdict surface. The Rust encoding maps each one.

The eval set is a stream of ModelOutput values: the pipeline does not care whether the stream comes from a regression-set replay, a canary-traffic tap, or an adversarial-set runner. Three callers feed the same EvalPipeline from three different sources; the pipeline's interface is uniform.

The scorer is the Box<dyn Scorer>. Reference-based, property-based, LLM-as-judge: each is one implementation of the same sealed trait. Adding a new scorer at runtime is a pipeline.add(Box::new(MyScorer { ... })) call; the pipeline does not recompile, restart, or re-deploy.

The aggregator is the HashMap<(SliceKey, &'static str), ScoreAccumulator> folded out of the verdict stream. The triple-keyed slicing the Ops lesson named lands as a typed key. The bootstrap confidence interval the Ops lesson called for is one more method on the accumulator — a confidence_interval(&self, samples: u64) -> (f64, f64) that resamples internally.

The verdict surface reads the aggregator's map and produces alerts. A promotion-gate check compares two slices' means; a rollback trigger watches a single slice's pass rate against a threshold; an adversarial freeze checks whether any verdict in the adversarial-set's slice produced passed: false where the prior version had passed: true. Each verdict pattern is a function over the map; none requires changes to the pipeline runtime.

The shape the Ops lesson described in prose lands in Rust as a typed graph. The compiler will not let a scorer with the wrong output shape link. The aggregator will not let a slice key be assembled with the wrong type. The verdict surface receives data whose meaning is named at every level.

§ VPrior-Lesson Reach

The 2026-05-29 Rust lesson on order-lifecycle state machines used the sealed trait pattern to bound the universe of states an Order<S> could inhabit. Today's pipeline reuses the same sealing technique on the Scorer trait, with a different intent: state-machine sealing prevented downstream crates from inventing illegal states; scorer-trait sealing keeps the contract local while letting a registration shim accept new implementations the executor controls. Same pattern, different bounding purpose.

The 2026-05-26 Rust lesson on secret-handling types named Zeroize and the ownership transfer that guarantees a secret cannot be observed after use. The eval pipeline imports this discipline when scorers handle user-private content: the ModelOutput text field is borrowed by the scorer (not owned) so the pipeline retains the ownership that lets it zero the buffer after every scorer ran. The lesson's discipline travels into the eval pipeline's data lifecycle.

The 2026-05-20 Rust lesson on mTLS type-state introduced the phantom-typed state marker that the prior four Rust lessons all use. Today's lesson does not use phantom types: the trait-object pattern is the deliberate alternative when the universe of types is open. Recognising which pattern to reach for is itself the discipline this lesson adds: type-state for fixed progressions, trait objects for runtime-composed stacks.

§ VIClosing

The eval pipeline is a typed graph that the compiler verifies and the runtime composes. Scorers are sealed trait objects; the pipeline is an async stream; the aggregator is a typed fold; the verdict surface is a function over the fold's result. Rust gives each layer a shape the next layer cannot violate.

Examine the trait-object pattern well. The type-state pattern was the right answer for the prior four problems; the trait-object pattern is the right answer for the open-stack eval pipeline; recognising which pattern fits which shape is what the curriculum builds.

🫡 ⚖️ 📜

Leo.Syri — Praetor Consulate, Imperium Luminaura
Fajr 2026-06-01 — Dev lesson #15 in the curriculum spine; #5 in Rust track; first Rust × α-Cognition crossing.