Hedronite · Synthesis Lesson · Data Science Tier · R + Python · Thu 2026-06-11

R and Python for On-Chain Event Analysis

Tidy data frames, distributional sanity checks, and the two-language discipline.

Lesson Class: Dev Synthesis

Tier / Language: Data Science — R + Python synthesis (first R lesson; first Thursday under tier rotation)

Week / Cycle: Week 4 of Cycle 1

Word Count: ~2,480

Paired Ops: Chain Indexing and On-Chain Data Pipelines (δ-Chain)

Paired Cert: Grounding Agents in Enterprise Knowledge — Bedrock Knowledge Bases, RAG, Copilot Context

Discipline: ROD v3 (universal-application) · clean code blocks, prose-located explanation

§ IFrame

Today the Dev track gains a second working language. The tier rotation makes Thursday the Data Science day, R anchors it, and Python rides alongside because the two languages split one job between them. This first lesson opens the R track where today's Ops lesson left off: the indexer has landed an exchange-transfer table, the heights are all present, and the question that remains is the one presence cannot answer. Does the data look like the chain?

That question is statistical, and it is the natural habitat of R. The language was built by statisticians for exactly this motion: load a table, interrogate its shape, plot its distributions, and notice what a row count never shows. Python then carries whatever the interrogation finds into the production pipeline, because the indexer, the schedulers, and the signal code from the prior γ and δ lessons already live there. Hold the division of labor as a coined rule, explore in R, enforce in Python: R is where an analyst discovers which checks matter; Python is where those checks run unattended at three in the morning.

Wickham and Grolemund open their canon with the claim that data science is the discipline of turning raw data into understanding, and they spend the entire Wrangle part of the book on one prerequisite: getting data into a shape where questions are cheap (R for Data Science, Preface p. 18; Tidy Data p. 176). The indexer did the engineering half. This lesson does the understanding half, twice, in two languages, on the same table.

Tome Grounding R for Data Science — Wickham & Grolemund · Tidy Data p 176 · The Tidyverse p 18 · grounded-in (domain-canonical)
Python for Algorithmic Trading Cookbook · pandas chapter pp 60–96 · referenced (domain-canonical)

§ IILanguage Idiom — Tidy Data and the Two Grammars

R's tidyverse rests on one structural rule, and it is the same rule the Ops lesson called the grain. Tidy data means every variable is a column, every observation is a row, and every type of observational unit is its own table (R for Data Science, p. 176). The indexer's transfer table is already tidy because Kimball's grain declaration and Wickham's tidy form are one idea wearing two vocabularies: a row is one transfer, full stop. This is why a well-engineered pipeline output drops straight into dplyr with no reshaping.

The working idiom in R is the pipe chain over a tibble. Five verbs do most of the work: filter picks rows, select picks columns, mutate derives new columns, group_by plus summarise collapse groups to statistics, and arrange orders the result. The pipe (|>) threads a data frame through the verbs so the code reads as the analysis plan. The same plan in pandas reads as method chaining: .query, .assign, .groupby, .agg. The Cookbook's pandas chapter works financial market data through exactly these motions (pp. 60–96), and the correspondence is close enough that an analyst fluent in one grammar reads the other within a week. Two dialects, one logic.

Where the languages genuinely differ is posture. R sessions are conversational: every intermediate result prints, plots are one verb away, and the cost of one more question approaches zero. Python in production is contractual: typed boundaries per Monday's Protocol lesson, explicit failure modes per the kill-switch lesson, and Decimal arithmetic for anything denominated in money per the 06-02 quantization lesson. The amounts column in this table is token quanta; R may treat it as floating point for shape-finding, and the moment a number becomes a position or a payment, the Python side owns it under the decimal discipline.

§ IIICode Worked Example — Interrogation in R, Enforcement in Python

The table is the Ops lesson's settled-zone output: one row per transfer, with height, ts, tx_hash, sender, receiver, denom, and amount. First, the R interrogation. Load and confirm the grain holds. Duplicate natural keys would mean the idempotent upsert failed somewhere:

library(tidyverse)

transfers <- read_csv("transfers_settled.csv",
  col_types = cols(height = col_integer(), ts = col_datetime(),
                   amount = col_double(), .default = col_character()))

transfers |>
  count(tx_hash, height, sender, receiver, denom, amount) |>
  filter(n > 1)

An empty result is the pass condition. Next, the gap arithmetic the Ops lesson named as the missing_heights check. Set difference between the expected height range and the observed heights is one line in either language:

expected <- seq(min(transfers$height), max(transfers$height))
missing  <- setdiff(expected, unique(transfers$height))
length(missing)

A zero here proves every height landed. It proves nothing about whether the blocks were served whole. For that, look at the shape of activity. Transfers per height should be a noisy but stationary band; a decoder bug or a half-served block shows up as a regime change in that band. The plot takes four lines, and this cheapness is the entire argument for the R side of the discipline:

transfers |>
  count(height, name = "transfers_per_block") |>
  ggplot(aes(height, transfers_per_block)) +
  geom_line(linewidth = 0.3) +
  geom_smooth(method = "loess", span = 0.1)

The analyst reads this plot the way Tuesday's operator read node lag: not for any single value but for the discontinuity. A six-hour outage backfilled correctly shows nothing, since the heights were repaired at full density. The same outage backfilled against a pruned node that silently served empty results shows a flat shelf at zero density across the gap, with every height technically present. The first check passed; only the distribution catches the lie. One more interrogation pays for itself, the empirical cumulative distribution of transfer sizes, because decoder drift that drops one event type shifts the size distribution before any count looks wrong:

transfers |>
  mutate(log_amount = log10(amount)) |>
  ggplot(aes(log_amount)) +
  stat_ecdf() +
  facet_wrap(vars(denom), scales = "free_x")

Now the Python enforcement. The exploration found three checks worth running forever: key uniqueness, height completeness, and per-block density inside a tolerated band. They compile into a small gate the pipeline runs after every backfill, in the dialect the rest of the estate already speaks:

Python

import pandas as pd

def verify_settled(df: pd.DataFrame, density_floor: float, density_ceiling: float) -> list[str]:
    failures = []
    key = ["tx_hash", "height", "sender", "receiver", "denom", "amount"]
    if df.duplicated(subset=key).any():
        failures.append("duplicate natural keys")
    expected = set(range(df["height"].min(), df["height"].max() + 1))
    if missing := expected - set(df["height"].unique()):
        failures.append(f"{len(missing)} missing heights")
    density = df.groupby("height").size().rolling(600, min_periods=600).median().dropna()
    if ((density < density_floor) | (density > density_ceiling)).any():
        failures.append("per-block density left tolerated band")
    return failures

The function returns a list of failures rather than raising on the first, the same accumulate-then-verdict shape the eval-pipeline and pre-trade-gate lessons used: downstream wants the whole indictment, not the first count. The rolling median over six hundred blocks is the production translation of the loess curve the analyst eyeballed; where the human read a smoothed line for shelves, the gate reads a rolling statistic against a band whose edges the R exploration chose. That choice of edges is the handoff, and it is the whole point of the two-language loop. R discovered what normal looks like; Python now refuses anything that doesn't.

§ IVConnection to Today's Ops Lesson

The Ops lesson closed §IV with the observation that presence is a weaker claim than plausibility, and this lesson is that sentence unpacked into working code. The indexer's own checks are structural: cursor committed, parent hashes consistent, heights complete. Every one of them can pass while the table is wrong, because they audit the pipeline's motion rather than the data's shape. The distributional checks audit the shape. The two layers fail independently, which is exactly why both exist; it is the same redundancy argument the Ops lesson made when the indexer re-verified parent hashes behind an already-health-gated RPC pool. The reader gates the source, the pipeline gates the rows, and the analyst gates the distribution.

Explore in R, Enforce in Python The Ops lesson's finality gate said a row earns trust by depth. The density band says a table earns trust by shape. Both are refusals to grant belief at arrival, applied at different grains. R is where the shape of normal is discovered; Python is where that shape becomes a standing refusal.

§ VPrior-Lesson Reach

The decimal quantization lesson (Python, 2026-06-02) governs the boundary this lesson deliberately walks up to and stops at: amounts may live as doubles while the question is distributional shape, and must become Decimal the moment any number is owed to a ledger, a position, or a counterparty. The ECDF plot tolerates floating point; a cost-basis calculation does not.

The kill-switch lesson (Python, 2026-06-05) built the runtime monitors that watch live systems; today's gate is their batch sibling. Same verdict-producing posture, different cadence: the watchdog runs on the clock, the verification gate runs on the event, and both exist so that no human has to remember to check.

The Protocols and stage-composition lesson (Python, 2026-06-08) supplies the production frame verify_settled slots into: a typed stage in a typed pipeline, composable behind a Protocol, so the retraining and signal pipelines can demand a verification stage without caring which checks it carries that week.

Paired Ops → Archmagus-Stack/δ-Chain/Synthesis-Lessons/2026-06-11-chain-indexing-and-on-chain-data-pipelines-event-extraction-reorg-safe-ingestion-and-the-backfill-discipline
Paired Cert → Cert-Prep/AWS/2026-06-11-grounding-agents-in-enterprise-knowledge-amazon-bedrock-knowledge-bases-the-rag-pipeline-and-github-copilot-context

§ VIClosing

A new language entered the curriculum today, and it entered through a working seam rather than a syntax tour. R's claim on the Thursday slot is the conversational interrogation of data: tidy verbs, cheap plots, and the statistician's reflex of looking at shape before trusting summary. Python's claim is permanence: whatever the interrogation finds worth keeping becomes a typed, decimal-disciplined, verdict-shaped gate in the production line. Explore in R, enforce in Python. Two grammars, one discipline, and a transfer table that now has to satisfy both before anyone trades on it.

Run the four R chains against any table you already trust. If the plots show you nothing you did not know, the trust was earned. If they show you a shelf, a spike, or a missing mode, the trust was a habit.

🫡 ⚖️ 📜

Leo.Syri — Praetor Consulate, Imperium Luminaura
Filed 2026-06-11 Thursday Fajr · Data Science tier (R + Python synthesis) · first R lesson in the curriculum
Backward-Synergy-Reach → decimal quantization (Py 06-02) · kill-switch monitors (Py 06-05) · Protocols & stage composition (Py 06-08)
HEDRONITE-AETHER-THEME v2.1 applied · wood-accent meta-card per Data Science tier · per-language code borders R=wood / Python=water · tome-grounded per LEO-AMEND-2026-06-10-001