Agent Memory Layers in Production
Session-learning caches, persistent recall stores, and the operator discipline they demand.
§ IFrame
An LLM call is stateless. The agent that wraps it is not. Between calls, the agent reads from some store, writes to some store, and decides which store to consult next. The operator who deploys this agent inherits the consequences of those choices: cost, latency, correctness, audit. Naming the stores plainly is the first job, because the operator is the one who pages at three in the morning when the wrong store answers a question it had no business answering.
This lesson treats the three memory layers an agent of any seriousness depends on. The session-learning cache watches what worked and what didn't within a single working session. The working-context buffer is what the model reads at inference time. The persistent recall store is what the agent reaches when its working buffer cannot hold the relevant past. Each layer has its own deployment shape, its own failure mode, and its own audit demand. The Monday lesson named the orchestration spine of multi-agent pipelines. The Tuesday lesson named what to watch as those pipelines run. Memory is the third member of that triad: the thing the operator must build, version, and observe with the same care given to the agents themselves.
§ IIFoundations — The Three Layers, Named
The session-learning cache
A scoped, mutable record of what the agent has tried in the current working session and how each attempt fared. The Yi-Lu claude-smart plugin is the recent canon-source: it layers above Claude Code's existing memory and watches command outcomes, so that an npm test that hangs becomes a learned npm test -- --run next time the same project context is detected. Lives at the agent runtime, not the model. Expires when the session closes, or is promoted to the persistent layer through an explicit codification act.
The working-context buffer
The token window the model actually reads at inference. Bounded by the model's context size and by the operator's cost budget. Every decision about what enters this buffer is also a decision about what does not. KV-cache reuse, where the model's internal attention cache is served back to subsequent calls, is the cost-and-latency angle. Anthropic's prompt-caching primitive and OpenAI's analogous feature both expose KV reuse to the operator as a billing line.
The persistent recall store
A long-lived store the agent queries when the working buffer cannot hold the relevant past. Two dominant shapes. Embedding-keyed: vector stores like Chroma, Weaviate, pgvector, LanceDB. Structured-key: a Postgres or SQLite table where the agent reads by user-id, session-id, or some semantic key the operator pre-computed. Outlives the session, the user-conversation, the deployment. Operator discipline at this layer is the highest of the three, because errors persist.
The three layers are not independent. The session-learning cache feeds the working-context buffer; its hits are injected as system-prompt additions. The persistent recall store also feeds the working-context buffer; its hits are retrieved and injected. The buffer is the only layer the model reads directly. The other two are operator-mediated. This shape is what makes memory an operations concern rather than a model-research concern.
§ IIIMechanism — Cache Strategies, Eviction, and the Version Problem
Three mechanism questions decide whether a memory architecture is production-fit or merely a working demo.
Question one: what gets cached, and at what key?
A session-learning cache that keys on the literal command string will miss every command that differs by a whitespace character. A cache that keys on the AST or the normalized intent will hit too broadly and serve stale lessons to a context that has changed beneath them. The Yi-Lu approach keys on a project-context hash plus a normalized command form; the operator tunes the normalization at deploy time. Get the key wrong and the cache either never fires or fires when it should not. The audit pattern is straightforward: log every cache lookup, log every cache hit, and sample some fraction of hits for human review during the first weeks of any new key scheme.
Question two: when does the cache get evicted?
Session caches die at session close, which is the easy answer. Working-buffer KV caches evict by recency-of-touch as new tokens arrive and the buffer rolls. Persistent recall stores have no natural eviction; they grow until the operator intervenes. The intervention can be passive (TTL on every row) or active (a scheduled job that prunes rows the agent has not retrieved in N days). The wrong eviction policy at the persistent layer is the most expensive of the three failure modes. Either the store grows unbounded and retrieval slows to the latency floor of the underlying vector index, or the eviction is too aggressive and the agent forgets the user's name three days after they told it.
Question three: how does the cache get versioned?
This is where most agent-memory designs fail in their second quarter. The model upgrades. The embedding model the persistent store uses upgrades. The schema of the structured-key table changes when a new field is added to capture an interaction-class the agent did not previously track. Each of these is a memory-version event. The operator must decide whether to lazily migrate, by rewriting rows as they are touched, or eagerly migrate with a one-shot rewrite of the whole store on a maintenance window. The eager path is the safer one for embedding upgrades: mixed-embedding stores produce silent retrieval failure where the agent gets back rows that no longer match the query's vector geometry. The lazy path is acceptable for structured-key schema additions because the unmigrated row is still readable; the new field is simply absent.
§ IVWorked Example — A Three-Layer Memory Stack in Kubernetes
Consider a customer-support agent deployed on a four-node Kubernetes cluster. The agent receives tickets, answers them, and sometimes escalates. The operator wants the agent to learn from corrections within a session, to maintain the working-context buffer at the model's optimal cost-quality point, and to recall past tickets from the same customer when relevant.
The session-learning cache lives as a Redis instance behind a ClusterIP service, one logical database per active ticket, with a TTL of six hours. The agent runtime writes a row on every tool-use failure recovery: the agent tried action X, the human-in-the-loop corrected it to action Y, the cache records the substitution. The next time the same ticket-context surfaces, the runtime reads the cache and injects the substitutions as a system-prompt prefix. The runtime exports two Prometheus metrics, agent_session_cache_lookups_total and agent_session_cache_hits_total. The ratio between them is the first health signal.
The working-context buffer is shaped by the runtime's prompt-assembly logic and by the model's KV-cache feature. The system prompt is held constant within a ticket so the KV-cache stays warm across multi-turn dialog. The retrieved-past-tickets section is structured as cache-stable retrieval, sorted by ticket ID rather than by relevance score, so two consecutive turns with the same retrieval set produce identical prefixes the model can serve from cache. The variable section, namely the current user turn, lands at the end of the buffer, where its newness does not invalidate the prefix. Tuesday's observability lesson named the OTLP span the runtime emits at each LLM call; the span carries a kv_cache_read_tokens attribute pulled from the model API's response, which becomes the second health signal.
The persistent recall store is a pgvector table behind the cluster's PostgreSQL operator. Schema: (id uuid, customer_id uuid, ticket_id uuid, embedding vector(1536), embedding_model text, text_content text, created_at timestamptz, last_retrieved_at timestamptz). The embedding-version-fence appears as WHERE embedding_model = $1 in every retrieval query. A nightly CronJob runs a re-embedding pass when the embedding-model identifier changes; these are rare, planned events. The store grows; the operator runs a quarterly pruning job that drops rows where last_retrieved_at < NOW() - INTERVAL '180 days' and customer_id is not in the active-customers materialized view. The third health signal is retrieval p95 latency, alerted at the SLO boundary.
The Monday orchestration lesson described how the supervisor agent farms work to specialist agents. In this stack, the supervisor reads the session-learning cache before farming, so its routing decisions inherit what the session has already learned. The Tuesday observability lesson described the trace shape; the trace now carries three additional spans, one per memory layer touched, so the on-call engineer can answer the question did the right store answer this turn without rerunning the agent.
§ VConnection to Prior Lessons
Monday's lesson treated the orchestration topology of a multi-agent system as a static graph: supervisor at the root, specialists at the leaves, edges defined by tool-grants and routing rules. Memory is the thing that bends that static graph into a temporal one. The same supervisor on the same topology, with a different session-learning cache, makes different routing decisions. The graph is the same; the agent's behavior is not.
Tuesday's lesson treated observability as the operator's view into a running pipeline: traces, metrics, structured logs, the canonical OTLP shape carried across agent boundaries. Memory introduces three new things to watch (the three health signals above), and it also introduces a new failure class the observability lesson did not yet name: the silent-correctness failure. A pipeline that succeeds end-to-end but answers from the wrong memory layer leaves a green dashboard and a wrong answer. Memory observability has to include not just whether the layers were touched but whether the layer that answered was the layer that should have. The audit pattern is to sample some fraction of retrievals for human review against the agent's eventual answer.
Monday's Python iterator-protocol Dev lesson named lazy evaluation as the orchestration primitive at the language level. Memory writes are not lazy. They are eager side-effects in the agent runtime. The asymmetry matters; the operator who treats memory writes as if they were lazy will lose data on runtime crashes that occurred between the agent's intent-to-write and the actual write. Every memory write must be acknowledged before the agent treats it as committed, the same discipline a relational database insert demands.
§ VIConnection to Today's Dev Lesson
The companion Python Dev lesson treats contextvars, the standard-library primitive for per-task state propagation across async/await boundaries. The thread-local pattern does not work for async code because a single OS thread services many coroutines; contextvars is the asyncio-native replacement. It is the missing piece for an agent runtime that runs many concurrent tool calls in the same process and needs each tool call to see its session-learning cache, its trace context, its tenant identifier, without explicitly threading those values through every function signature.
The Ops pattern above named the three memory layers and where each lives in the cluster. The Dev lesson names the language-level primitive the runtime code uses to carry the right session's memory through the right code path. Together they form the operator-and-developer face of a single discipline: in production, memory is both an infrastructure concern and an in-process concurrency concern, and a system that addresses one without the other will leak state across sessions and answer the wrong customer from the wrong cache.
Paired lesson → Polyglot-Dev/Python/2026-05-21-pythons-contextvars-for-per-task-memory-propagation-in-async-agent-pipelines
§ VIIClosing
Three layers, three failure modes, three audit disciplines. The session-learning cache fails by keying wrong. The working-context buffer fails by cost or KV-invalidation. The persistent recall store fails by unbounded growth or by silent embedding-version mismatch. The operator who builds the system also builds the audit. The audit is not separate from the system; it is the part of the system that catches the system when the system is wrong.
The named pattern this lesson contributes to the curriculum is the embedding-version-fence: every persistent-memory row carries its embedding-model identifier, every retrieval filters on the current identifier, migration is planned and observed. Internalize this fence first; the other audits proceed from it.
Examine this well. Build the next agent with the three layers named explicitly before any code goes to the cluster. Then read tomorrow's Friday γ Adversarial-Markets lesson, where memory shows up again, this time as the feature store that mispricing-detection agents consult before deciding whether a regime classifier's signal is worth trading on.
Filed 2026-05-21 Thursday Fajr · α Cognition Ops lesson #4 · Pair α (Cognition) + DevOps anchor
Backward-Synergy-Reach → Multi-Agent Orchestration (Mon) · Observability for Multi-Agent LLM Systems (Tue) · Python Iterator Protocol Dev (Mon)
HTML render backfilled 2026-05-25 under approved scaffold + sea-green aether palette