asyncio TaskGroups and Cancellation Scopes
Structured concurrency for live risk monitors — the watchdog and kill-switch disciplines.
§ IFrame
Today's Ops lesson named a machine that runs several timers and several feed-readers at once and must be able to bring the whole set down the instant any one of them trips. That shape has a name in Python, and the name is structured concurrency. A flat pile of tasks launched with asyncio.create_task and never joined is the opposite shape: each task lives and dies on its own, a failure in one leaks the rest, and the kill-switch the operator needs has no single handle to pull. The runtime watch wants a supervisor that owns its children as a group, fails as a group, and cancels as a group. Python 3.11 added that supervisor as asyncio.TaskGroup, and the watch is one of the cleanest things to build with it.
The watch also wants a timer that does not fire a callback but instead raises inside the work it was timing. The staleness watchdog is precisely a deadline: if no message arrives before the deadline, the read was too slow, and the slowness must propagate as cancellation rather than as a flag someone has to remember to check. Python expresses that deadline as asyncio.timeout, a context manager that cancels the block it wraps when the clock runs out. And the watch wants one more thing the other two would otherwise destroy: a way to run the risk-off cascade to completion even though the cancellation that triggered it is, by construction, trying to cancel everything in sight. That is asyncio.shield, and getting it right is the difference between a cascade that flattens the book and a cascade that gets cancelled halfway and leaves it half-open.
§ IILanguage Idiom — Three Primitives of Structured Concurrency
The first primitive is asyncio.TaskGroup. A task group is an async context manager. Tasks created through the group are owned by the group, and the group's __aexit__ waits for all of them. The property that matters for the watch is the failure rule: if any task in the group raises an unhandled exception, the group cancels every other task in the group and then propagates. This is the supervisor collapse the Ops lesson asked for. The watchdog task, the breaker task, and the feed-reader tasks all live in one group; when the watchdog raises its trip, the group cancels the breaker and the readers, and the trip propagates out to the cascade handler. No task leaks. The collapse is automatic and total.
The second primitive is asyncio.timeout, the deadline expressed as a context manager. The block inside async with asyncio.timeout(bound) is cancelled if it does not finish before bound seconds elapse. For the watchdog the bound is the input's staleness limit, and the block is a single read of the input. A read that returns before the bound resets the loop and starts a fresh deadline; a read that does not return in time is cancelled, and the cancellation is the watchdog's trip. The watchdog does not poll a clock and compare timestamps. The deadline is the clock, and missing it is the signal.
The third primitive is asyncio.shield. A shielded awaitable keeps running even when the surrounding scope is cancelled. This is the one the watch cannot do without and the one most easily gotten wrong. The risk-off cascade is started by a cancellation, and it lives inside a watch that is collapsing around it. Without a shield, the same cancellation wave that triggered the cascade flows straight into the cascade's own awaits and cancels the flatten between the cancel-orders step and the send-flatten step. The shield holds the cascade's critical section outside the cancellation's reach, so the cascade completes on its own terms and only then lets the collapse finish.
§ IIICode Worked Example
Begin with the watchdog. The watchdog wraps a single read of one input in a deadline and loops. Each successful read resets the deadline by starting the next iteration; a read that exceeds the bound raises TimeoutError, which the watch treats as the staleness trip for that input. The bound and the input name travel together so the trip can say which feed died.
import asyncio
class StalenessError(Exception):
def __init__(self, source):
super().__init__(f"input went stale: {source}")
self.source = source
async def watchdog(source, read_one, bound_seconds):
while True:
try:
async with asyncio.timeout(bound_seconds):
await read_one()
except TimeoutError:
raise StalenessError(source)
The read_one callable awaits the next accepted message from one feed and returns when it arrives. The watchdog never inspects a timestamp. The deadline does the comparison, and a feed that holds its socket open while delivering nothing trips the deadline by failing to return. When the deadline fires, the watchdog converts the anonymous TimeoutError into a named StalenessError carrying the source, so the runtime-event log can record which input went blind.
The breaker is the second child. It reads the live mark, recomputes realized-plus-unrealized loss against the session ceiling, and raises when the integrated drawdown crosses the limit. The breaker reads the same Decimal quantities the 2026-06-02 lesson disciplined, so the loss comparison is exact to the tick and never a float that drifts past the ceiling by a rounding error.
from decimal import Decimal
class DrawdownBreached(Exception):
def __init__(self, loss, ceiling):
super().__init__(f"drawdown {loss} crossed ceiling {ceiling}")
self.loss = loss
self.ceiling = ceiling
async def breaker(book, ceiling: Decimal):
while True:
loss = await book.marked_loss()
if loss >= ceiling:
raise DrawdownBreached(loss, ceiling)
await asyncio.sleep(book.tick_interval)
Both children raise rather than return. That is the supervisor contract: a child that detects a trip condition raises, and the task group turns one child's raise into the cancellation of every sibling. The watch's main routine launches the children in a single group and lets the first raise collapse the rest.
async def run_watch(book, inputs, ceiling):
try:
async with asyncio.TaskGroup() as tg:
tg.create_task(breaker(book, ceiling))
for source, read_one, bound in inputs:
tg.create_task(watchdog(source, read_one, bound))
except* StalenessError as eg:
await risk_off(book, cause=eg.exceptions[0])
except* DrawdownBreached as eg:
await risk_off(book, cause=eg.exceptions[0])
The except* syntax is the exception-group handler the task group raises through. When the watchdog trips, the group cancels the breaker and the other watchdogs, bundles the StalenessError into an exception group, and the first except* arm catches it. When the breaker trips first, the second arm catches its DrawdownBreached. Either arm calls risk_off with the cause, and the cause carries straight into the runtime-event log so the report names whether the trip was infrastructure or strategy.
Now the part that must not be cancelled. The risk_off cascade runs after the group has already collapsed, but the collapse's cancellation can still be in flight, and the cascade's own awaits are vulnerable to it. Wrap the cascade's critical section in asyncio.shield so the cancel-orders-then-flatten sequence completes as a unit.
async def risk_off(book, cause):
await asyncio.shield(_cascade(book, cause))
async def _cascade(book, cause):
await book.cancel_all_resting()
await book.confirm_cancels()
await book.flatten(exclude=book.protective_hedges())
await book.log_event("risk_off", cause=cause)
The shield holds _cascade outside the cancellation that triggered the whole sequence. The order inside the cascade is the Ops lesson's order exactly: cancel resting orders, confirm the cancels, then flatten everything except the hedges marked protective. Because the cascade is shielded, a second cancellation arriving mid-flatten cannot split it between the cancel step and the flatten step. The flatten finishes, the event logs, and only then does the watch's collapse complete.
§ IVConnection to Today's Ops Lesson
The Ops lesson named three primitives and one ordering rule; the Python here is each of them made concrete. The drawdown circuit breaker is the breaker coroutine that raises DrawdownBreached. The staleness watchdog is the watchdog coroutine whose deadline trip raises StalenessError. The risk-off cascade is _cascade, and its cancel-before-flatten order is four awaits in sequence with the protective hedge excluded by name. The standing watch that runs all of them beside the strategy is the TaskGroup, whose collapse rule turns any one child's raise into the cancellation of the rest.
The single most important line in the file is the asyncio.shield in risk_off. The Ops lesson warned that a watch built on cancellation will, by its nature, try to cancel the cascade the instant the cascade starts. The shield is the answer to that exact hazard, and it is the one piece of the design that has no analog in the Ops prose because it is purely a property of how the language propagates cancellation. An operator who read only the Ops lesson would write a correct-looking cascade with no shield and discover under load that the flatten gets cancelled between the cancel and the send. The language detail is not decoration on the Ops design; it is the part of the design the Ops view cannot see.
§ VPrior-Lesson Reach
The 2026-05-30 validator-telemetry lesson built async RPC polling with gather over a fan-out of read tasks. That lesson's shape was cooperative and steady-state: many reads, all expected to succeed, results collected together. Today's watch is the adversarial counterpart of the same async toolkit. Where the telemetry lesson collected successes, the watch propagates the first failure and collapses on it. TaskGroup is the right tool here precisely where gather was the right tool there: gather waits for everyone, while the group cancels everyone the moment one trips. The two lessons together map the toolkit's two postures: gather to harvest, group to supervise.
The 2026-06-02 decimal lesson supplies the breaker's arithmetic. The drawdown comparison loss >= ceiling is only as trustworthy as the type of loss. A float loss accumulated tick by tick drifts, and a breaker that compares a drifted float against an exact ceiling either trips a hair early or, worse, fails to trip when it should. The breaker reads Decimal quantities constructed and quantized under the 06-02 discipline, so the ceiling comparison is exact at the tick the operator set it to. The 2026-05-21 contextvars lesson supplies the cross-cutting thread: each task in the group carries its own context, so the runtime-event log written from inside the cascade can read the strategy-id and session-id from context variables without those values being threaded through every function call.
§ VIClosing
Structured concurrency is the language admitting that tasks have lifetimes and that lifetimes have owners. A pile of fire-and-forget tasks is a watch with no kill-switch, because there is no single handle that brings the set down. The task group is the handle. The timeout is the trip. The shield is the one part of the machine that runs against the grain of everything else, holding the flatten together while the collapse it belongs to tries to take it down with the rest.
Write the watch as a group, trip it with a deadline, and shield the flatten. Then test the shield by cancelling the watch mid-cascade and proving the book still ends flat. A risk-off cascade that has never been interrupted in a test is a cascade whose worst case has not yet been observed.
Examine well. Reflect on this.