Model-Serving Topology for Multi-Agent Cognition Systems
Routing, pooling, versioning, and the inference-cost surface.
§ IFrame
Five lessons into the α-Cognition arc, the curriculum has covered orchestration (how agents coordinate), observability (how the operator sees them work), memory (how they recall what mattered), and the contextvars primitive in Python for keeping that memory per-task. Each lesson has assumed an unspoken fact: when an agent invokes a model, the model answers. This lesson names what makes that fact true in production.
Model-serving topology is the layer between the agent and the inference. It decides which model handles a request, how many concurrent requests a model can absorb, what happens when a model version updates mid-flight, and what each call costs. It is the layer where good architecture quietly absorbs traffic spikes and bad architecture quietly burns the operator's budget.
The operator who treats model-serving as call the API inherits whatever defaults the vendor or framework provides. The operator who treats it as a topology problem decides routing, pooling, versioning, and cost shape deliberately. The difference shows up at the first real load, when the question stops being does it work and starts being what does it cost when ten thousand requests arrive in five seconds.
§ IIFoundations
Four primitives carry the bulk of model-serving topology. Name them; reason about them after.
Routing
The decision of which model handles which request. Rules can be static (always send code-review to model A, summarization to model B) or dynamic (route by latency budget, by cost ceiling, by quality score, by the result of a smaller classifier). Where multi-model strategy lives.
Pooling
How concurrent requests share underlying compute. A model endpoint may serve one request at a time (poor utilization) or batch many through a single forward pass (high utilization but added latency for the first request in a batch). Where throughput is purchased, in latency units.
Versioning
The discipline that lets a model update without breaking in-flight requests or downstream agents. Carries a deployment story (blue-green, canary, shadow), a rollback story, and a contract-stability story (does v2 return the same shape as v1).
Cost surface
The accounting layer that turns abstract we called the model into a real number tied to a request, an agent, a user, or a budget cap. Where economic discipline becomes engineering discipline.
The four compose. A routing decision sends a request to a pooled endpoint of a particular version, which produces tokens that increment the cost surface. Each primitive can be operated badly without the others noticing; they only become a coherent topology when an operator designs them together.
§ IIIMechanism
How each primitive works inside a multi-agent system.
Routing across agent boundaries
When an orchestrator agent decides a worker should handle a sub-task, the worker still has to pick a model. The pick can happen at three layers. First, the worker can hard-code its model choice in its prompt template (simple, brittle when the right model changes). Second, the worker can call a router that knows the catalog of available models and the rules for choosing among them (decoupled, but the router becomes a critical dependency). Third, a sidecar gateway can intercept all model calls and apply routing centrally without the worker knowing (transparent to the agent code, opaque to the agent observability).
Each layer has a place. The hard-coded choice fits prototypes. The router fits production where model choice is itself a policy the operator wants to change without redeploying every agent. The sidecar gateway fits regulated environments where centralized rate-limiting, audit logging, and cost accounting are required by the surrounding governance, not the agent's discretion.
Routing rules themselves can be policy-driven (route by request metadata) or model-driven (a small classifier picks the model). Model-driven routing introduces an inference call to make an inference call, which seems redundant until you measure: a 50-millisecond classifier that prevents a 4-second wrong-model call pays for itself immediately. The cost surface tells you when the trade is worth it.
Pooling under variable load
A model endpoint that serves one request per second can serve eight per second by batching eight inputs through one forward pass, at the price of one extra second of latency for the first request that has to wait for the batch to fill. The latency-throughput trade is a knob, not a constant. Where the knob sits depends on the request's tolerance for latency and the GPU's tolerance for underutilization.
Three pooling regimes appear in practice. Static batching waits for a fixed number of requests or a fixed time window, whichever fills first. Simple, predictable, but wastes capacity when traffic is bursty. Dynamic batching assembles batches of variable size based on what arrived during the inference of the prior batch. More complex, higher utilization, harder to reason about latency tails. Continuous batching is the LLM-specific pattern where new requests join an in-progress batch at any decoding step rather than waiting for the prior batch to finish. Highest utilization, lowest tail latency, but requires runtime support (vLLM, TensorRT-LLM, TGI all implement variants).
The operator does not get to ignore this layer when scale arrives. Continuous batching can double GPU efficiency over static batching at the same latency budget. That doubling is the difference between provisioning eight GPUs and provisioning sixteen for the same traffic.
Versioning without breaking the herd
A model that has been live for three weeks has agents downstream that expect its behavior. A retrain produces a new version. The operator needs to know two things before deploying: does the new version's output shape break any consumer, and does its behavior change in ways the consumers depend on. The first is a contract test (run the new model against a fixed input set, diff the output structure). The second is an evaluation suite (run the new model against a labeled set, measure quality on the tasks the consumers actually issue).
Deployment patterns mirror the patterns from web-service deployment, with one wrinkle. Blue-green keeps two versions live and flips traffic atomically; rollback is a second flip. Canary routes a small percentage of traffic to the new version, watches metrics, ramps up if healthy. Shadow runs the new version on real traffic but discards its output, comparing it to the live version offline. Shadow is the pattern that handles the second question above: it lets the operator see how v2 would have answered the last seven days of real questions before any user sees v2's answer.
The wrinkle is statefulness. A model is technically stateless per call, but the agent that calls it is not. An agent that has cached context against v1's behavior may need its cache invalidated when v2 ships. The deployment plan covers the model and the consumers; missing the consumer side is the failure mode that shows up as everything worked in staging.
Cost surface that the agents can see
Token-priced inference makes cost a per-request fact rather than a per-month fact. The operator who lets agents call models without per-request cost attribution is the operator who learns from the monthly bill that a recursive agent loop spent four hundred dollars exploring its own confusion.
A useful cost surface carries five attributes per call: model identifier, input token count, output token count, computed cost in the operator's currency, and a cost-attribution key (which agent, which user, which feature, which budget bucket). The five attributes turn the inference layer into a queryable resource the operator can budget against, alert against, and optimize against.
Budget enforcement at the gateway is the operator's leverage. A gateway that knows the per-agent monthly budget and refuses calls when the budget is exhausted prevents the runaway loop from becoming a runaway invoice. The pattern is identical to API rate-limiting but the unit is money rather than requests.
§ IVWorked Example — Trading Agent Ensemble
A trading agent ensemble runs five worker agents (signal generation, position sizing, risk check, execution, reporting). Each worker calls models. The operator targets a daily inference budget of fifty dollars, a per-request p95 latency of two seconds, and a deployment cadence of weekly model retrains for the signal worker.
The topology that satisfies these targets has four pieces.
The router sits in front of all model calls. It is a small Go service (see today's paired Dev lesson) that holds the model catalog, the routing rules, and the cost-attribution logic. Workers call the router by name; the router selects the model, applies the call, returns the response with cost metadata attached. Workers never call vendor APIs directly.
The signal worker uses dynamic batching against a self-hosted small model on a single GPU. Signal generation is high-frequency, low-latency-tolerance, low-stakes per call. Batching saves compute; the latency cost is acceptable because signals arrive in waves rather than streams. The other four workers call hosted vendor APIs (no batching needed; the vendor handles it).
The signal worker's model deploys via canary, with the prior version staying live for two days as the new version's traffic ramps from five percent to fifty percent to one hundred percent. The risk-check worker, which depends on signal-worker output stability, runs an evaluation suite against signals from both versions during the canary period and gates promotion on the suite passing.
The cost surface attributes each call to a worker, each worker to a strategy, each strategy to the daily budget. When the signal worker's cost approaches its allocated forty-percent share of fifty dollars, the gateway throttles its model calls and the orchestrator switches signal generation to a cached-feature fallback. The runaway-loop case becomes a degraded-mode case, not an over-budget case.
§ VConnection to Prior Lessons
Three threads from earlier α-Cognition lessons land here.
The orchestration lesson described how an orchestrator delegates to workers. The router primitive of this lesson is what the workers call after the orchestrator hands them their sub-task. Orchestration covers the agent-to-agent decisions; routing covers the agent-to-model decisions. They are two layers of the same topology.
The observability lesson named tracing, metrics, and structured logging as the three primitives of seeing the system. The cost surface from this lesson is a fourth signal that joins those three: tokens per request, dollars per minute, budget consumed per agent. The cost dashboard is built on the same telemetry pipeline as the latency dashboard.
The memory-layers lesson distinguished session-learning caches from persistent recall stores. Both layers reduce inference calls, which reduces cost. The cost surface from this lesson is the metric that proves the memory investment paid back, by showing the inference-call reduction in dollars per day rather than in cache-hit-rate percentages alone.
§ VIConnection to Today's Dev Lesson
Today's paired Dev lesson takes Go's worker-pool and semaphore primitives and builds the router gateway this lesson described. The connection is direct: the lesson here names the topology; the Dev lesson names the language patterns that implement it.
Three pieces translate cleanly. The router's concurrent request handling becomes a Go worker pool sized to the GPU's batch capacity. The per-model rate limiting becomes Go semaphores held per-model. The per-agent cost budget becomes a Go channel that throttles or rejects when the budget surface signals exhaustion. The pattern of wait for a slot before calling the model becomes idiomatic Go in three small primitives.
Paired Dev lesson → Polyglot-Dev/Go/2026-05-25-gos-worker-pools-and-semaphores-for-inference-cost-aware-model-router-gateways
§ VIIConnection to Today's Cert Lesson
Today's cert lesson treats Vertex AI as the unified platform for training pipelines and generative-AI applications. Vertex AI is also the managed implementation of much of what this lesson described — Vertex AI Endpoints provide model serving with built-in routing across model versions, automatic batching, traffic-splitting for canary deployments, and per-call metering for cost attribution. The cert lesson shows the platform; this lesson shows the principles the platform implements.
The relationship matters. An operator who understands the principles can evaluate Vertex AI's defaults against the topology they need, decide which knobs to keep and which to override, and decide when self-hosting on GKE earns its complexity over the managed endpoint. The cert credential confirms platform fluency; the principles confirm the operator can use the platform deliberately rather than ceremonially.
Paired Cert lesson → Archmagus-Stack/09-Tomes/Cert Prep/Google/2026-05-25-vertex-ai-as-unified-platform-training-pipelines-meets-gemini-api-and-rag
§ VIIIClosing
Model-serving topology is one of the quietest layers of a multi-agent system. When the four primitives — routing, pooling, versioning, cost surface — are operating correctly, the layer is invisible. When any is operating badly, the failure mode is rarely a clean error. It is a slow drift toward higher latency, higher cost, or unpredictable behavior under load.
The operator who designs the topology before traffic arrives gets a system that behaves well under the first wave of real use. The operator who treats model-serving as an API call gets a system that performs adequately until it does not, and then asks the operator to debug a problem the topology should have prevented.
Five lessons into the α-Cognition arc, the curriculum has now named all the layers a production multi-agent system needs. Orchestration, observability, memory, contextvars, and serving topology together describe a system that runs. The remaining lessons in this arc will turn from the layers themselves to the disciplines that compose them well.
Filed 2026-05-25 Fajr (catch-up) · Inaugural three-lesson Fajr · Pair α (Cognition) + DevOps anchor · Week 2 of Cycle 1
Backward-Synergy-Reach → Multi-Agent Orchestration · Observability for Multi-Agent LLM Systems · Agent Memory Layers in Production