Hedronite · Synthesis Lesson · Dev · Go · Mon 2026-05-25

Go's Worker Pools and Semaphores for Inference-Cost-Aware Model Router Gateways

Three small primitives, one production-ready gateway.

Lesson Class: Dev Synthesis

Language: Go (Mon+Thu Week 2 = Go)

Week / Cycle: Week 2 of Cycle 1 (Lang Week 2)

Word Count: ~2,500

Paired Ops: Model-Serving Topology for Multi-Agent Cognition

Paired Cert: Vertex AI as Unified Platform

Discipline: ROD v0.4.0 (universal-application)

§ IFrame

Today's Ops lesson named four primitives of model-serving topology: routing, pooling, versioning, and cost surface. The first two live in code the operator writes; the third and fourth live in the surrounding infrastructure. This lesson takes the two that live in code and shows their Go form.

A model router gateway is a small service that sits between the agents and the model providers. Agents do not call vendor APIs directly; they call the gateway. The gateway decides which model handles the request, applies rate limits, attaches cost metadata, and returns the model's response with the accounting attached. The gateway is where Go's concurrency primitives earn their reputation: a few hundred lines of idiomatic worker-pool and semaphore code can absorb thousands of concurrent agent requests, route each one correctly, hold per-model rate limits without fighting itself, and refuse work cleanly when a budget is exhausted.

The lesson covers three primitives in their Go form: worker pools (the pattern that bounds parallelism), semaphores (the pattern that meters access to a constrained resource), and bounded channels for cost-aware backpressure (the pattern that turns a budget signal into a routing decision). Each is small in isolation; the value is in how they compose into a gateway that does its job under load.

§ IILanguage Idiom

Go's concurrency is built on goroutines, channels, and a scheduler that multiplexes goroutines onto OS threads. The idiom rewards small primitives that compose, not large frameworks that pre-decide. Three pieces of the standard library and one common pattern carry the bulk of the work.

Worker pools are the idiomatic way to bound parallelism in Go. The pattern: spawn N goroutines (workers) that read from a shared job channel; the goroutines run forever pulling work and processing it. Closing the channel signals all workers to exit. The pool size N caps the maximum concurrent work.

Semaphores in Go are typically implemented as buffered channels of a known capacity. To acquire a slot, send into the channel; to release, receive from it. The channel's buffer size is the semaphore's count. Go does not have a dedicated semaphore type in the standard library because the channel-as-semaphore pattern is so idiomatic; the x/sync/semaphore package exists for weighted semaphores when the work is not unit-cost.

Bounded channels are the standard backpressure primitive. A channel with a fixed buffer accepts sends until full, then blocks senders until receivers drain. Composed with select and timeouts, bounded channels become the canonical pattern for do this work if there is capacity; otherwise reject or wait briefly.

The pattern is repeated across the standard library: net/http's server uses a goroutine per request bounded only by available memory; database/sql uses a connection pool that is a worker pool over connections; the runtime itself uses a goroutine pool to multiplex onto OS threads. The router gateway is one more application of patterns the language already canonicalizes.

§ IIICode Worked Example

The gateway has three layers: an inbound request handler, a per-model worker pool, and a cost-budget enforcer. Each layer is small.

Inbound request handler

The handler accepts an agent request, parses the model identifier from the request, hands the request to the appropriate per-model worker pool, and returns the response. The pattern is the standard Go HTTP server idiom; the only addition is the model-pool dispatch.

package gateway

import (
    "encoding/json"
    "net/http"
)

type Request struct {
    Model      string                 `json:"model"`
    Agent      string                 `json:"agent"`
    Prompt     string                 `json:"prompt"`
    BudgetKey  string                 `json:"budget_key"`
    Metadata   map[string]interface{} `json:"metadata"`
}

type Response struct {
    Output       string  `json:"output"`
    InputTokens  int     `json:"input_tokens"`
    OutputTokens int     `json:"output_tokens"`
    CostUSD      float64 `json:"cost_usd"`
    ModelUsed    string  `json:"model_used"`
}

func (g *Gateway) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    var req Request
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        http.Error(w, err.Error(), http.StatusBadRequest)
        return
    }
    resp, err := g.Route(r.Context(), req)
    if err != nil {
        http.Error(w, err.Error(), http.StatusServiceUnavailable)
        return
    }
    json.NewEncoder(w).Encode(resp)
}

The handler is unremarkable on purpose. The work is in Route, which is where the concurrency primitives appear.

Per-model worker pool

Each model has its own worker pool, sized to the throughput the model can sustain. A small self-hosted model might have a pool of two; a hosted vendor model might have a pool of fifty. The pool absorbs the agent's request, calls the model, and returns the response.

type ModelPool struct {
    name     string
    workers  int
    inbox    chan job
    callFunc func(ctx context.Context, prompt string) (Response, error)
}

type job struct {
    ctx     context.Context
    prompt  string
    result  chan<- jobResult
}

type jobResult struct {
    resp Response
    err  error
}

func NewModelPool(name string, workers int, callFunc CallFunc) *ModelPool {
    p := &ModelPool{
        name:     name,
        workers:  workers,
        inbox:    make(chan job, workers*2),
        callFunc: callFunc,
    }
    for i := 0; i < workers; i++ {
        go p.run()
    }
    return p
}

func (p *ModelPool) run() {
    for j := range p.inbox {
        resp, err := p.callFunc(j.ctx, j.prompt)
        j.result <- jobResult{resp: resp, err: err}
    }
}

func (p *ModelPool) Submit(ctx context.Context, prompt string) (Response, error) {
    result := make(chan jobResult, 1)
    select {
    case p.inbox <- job{ctx: ctx, prompt: prompt, result: result}:
    case <-ctx.Done():
        return Response{}, ctx.Err()
    }
    select {
    case r := <-result:
        return r.resp, r.err
    case <-ctx.Done():
        return Response{}, ctx.Err()
    }
}

Two design choices are worth naming. The inbox buffer is twice the worker count, which gives a small queue for bursty traffic without unbounded memory growth. The Submit method uses select against ctx.Done() on both send and receive, which means a cancelled or timed-out request abandons cleanly instead of waiting forever for a slot.

Cost-budget enforcer with bounded-channel backpressure

The cost enforcer is the layer that turns a budget signal into a routing decision. It holds the per-budget-key remaining-dollar count and refuses (or routes to a degraded model) when the budget is exhausted.

type BudgetEnforcer struct {
    budgets map[string]*budgetState
    mu      sync.Mutex
}

type budgetState struct {
    remaining float64
    refillAt  time.Time
}

func (b *BudgetEnforcer) Check(key string, estimatedCost float64) error {
    b.mu.Lock()
    defer b.mu.Unlock()
    state, ok := b.budgets[key]
    if !ok {
        return fmt.Errorf("unknown budget key %s", key)
    }
    if state.remaining < estimatedCost {
        return ErrBudgetExhausted
    }
    state.remaining -= estimatedCost
    return nil
}

func (b *BudgetEnforcer) Refund(key string, actualCost, estimatedCost float64) {
    b.mu.Lock()
    defer b.mu.Unlock()
    if state, ok := b.budgets[key]; ok {
        state.remaining += (estimatedCost - actualCost)
    }
}

The estimate-then-refund pattern matters. The gateway cannot know the exact cost until the response returns (output tokens vary), so it reserves the upper-bound estimate at request time and refunds the difference after. This prevents the race where two requests both see enough budget and both proceed when only one actually fits.

Route — composing the three primitives

The Route method ties the pieces together.

func (g *Gateway) Route(ctx context.Context, req Request) (Response, error) {
    selectedModel := g.router.SelectModel(req)
    estimatedCost := g.estimator.Estimate(selectedModel, req.Prompt)
    if err := g.budgets.Check(req.BudgetKey, estimatedCost); err != nil {
        if errors.Is(err, ErrBudgetExhausted) {
            selectedModel = g.router.FallbackModel(req)
            estimatedCost = g.estimator.Estimate(selectedModel, req.Prompt)
            if err := g.budgets.Check(req.BudgetKey, estimatedCost); err != nil {
                return Response{}, err
            }
        } else {
            return Response{}, err
        }
    }
    pool := g.pools[selectedModel]
    resp, err := pool.Submit(ctx, req.Prompt)
    if err != nil {
        g.budgets.Refund(req.BudgetKey, 0, estimatedCost)
        return Response{}, err
    }
    g.budgets.Refund(req.BudgetKey, resp.CostUSD, estimatedCost)
    g.telemetry.Record(req, resp, selectedModel)
    return resp, nil
}

The flow reads cleanly: select a model, check the budget, fall back to a cheaper model if the primary is over-budget, submit to that model's pool, refund the difference between estimated and actual cost, record the call to telemetry. Each step is small; the composition is the gateway.

§ IVConnection to Today's Ops Lesson

The Ops lesson named four primitives: routing, pooling, versioning, cost surface. This Dev lesson covered three of them in their Go form.

Routing appears in the gateway's router.SelectModel(req) call. The router itself is a small policy engine the gateway holds; its implementation can be as simple as a map lookup or as complex as a small classifier model invoked recursively through the gateway.

Pooling appears as the per-model worker pool. The pool's worker count caps concurrent requests against that model; the inbox buffer absorbs short-term bursts; the Submit method gates new requests behind capacity. The pattern is the Go-canonical way to express this resource has finite capacity.

Cost surface appears as the budget enforcer with estimate-then-refund. The estimator could be a tiny model (a regression on prompt length) or a heuristic; either way, the gateway never lets a request proceed without first reserving its share of the budget. Runaway loops cannot accumulate cost without the enforcer noticing.

The fourth primitive, versioning, is a deployment concern that the gateway accommodates without owning — the router holds the active version of each model; deploying a new version is a router-policy update, not a gateway code change. The cleanly separated concerns are the Ops lesson's topology rendered into Go.

§ VPrior-Lesson Reach

Two prior Go lessons sit underneath this one.

The context.Context lesson covered how Go propagates request scope, deadlines, and cancellation across function boundaries. The gateway uses context at every layer: the inbound HTTP context flows into Route, then into the pool's Submit, which gates against ctx.Done() for both the send and the receive. A cancelled agent request abandons its slot cleanly rather than blocking the gateway. The pattern is the canonical Go form for request-scoped concurrency.

The channels and pipeline patterns lesson covered fan-out/fan-in and selectivity for multi-signal filter chains. The gateway is the same shape inverted — the request fans out across model pools (one selected per request), and the responses fan back in to the cost enforcer for accounting. The pipeline-thinking from that lesson applies directly: each layer of the gateway is a stage; the channels carry the work; the workers process; the result returns.

Primitive Reuse Doctrine The pattern is not specific to model serving. The same three primitives build database connection pools, work-queue consumers, request-throttling proxies, and rate-limited API clients. Learning the pattern in the model-router context teaches the operator a primitive they will deploy in every Go service they ship next.

§ VIClosing

Go's reputation for production-ready concurrency is earned by patterns like this one. Three primitives — worker pools, semaphores via buffered channels, bounded channels for backpressure — compose into a gateway that does work the operator otherwise has to buy a vendor product to get. The gateway holds the routing policy, the per-model rate limits, the cost-budget enforcement, and the cancellation discipline in code the operator can read, modify, and reason about.

When the next α-Cognition lesson arrives and the operator needs another layer of the multi-agent topology, the Go primitives that built today's gateway will be available unchanged. That is what idiomatic concurrency looks like when the language and the problem fit.

Paired Ops lesson → Archmagus-Stack/α-Cognition/Synthesis-Lessons/2026-05-25-model-serving-topology-for-multi-agent-cognition-systems
Paired Cert lesson → Archmagus-Stack/09-Tomes/Cert Prep/Google/2026-05-25-vertex-ai-as-unified-platform-training-pipelines-meets-gemini-api-and-rag

🫡 ⚖️ 📜

Leo.Syri — Praetor Consulate, Imperium Luminaura
Filed 2026-05-25 Fajr (catch-up) · Inaugural three-lesson Fajr · Dev · Go (Mon Week 2 Lang cycle)
Backward-Synergy-Reach → context.Context for Distributed Tracing · Go Channels and Pipeline Patterns