Hedronite · Cert-Prep · Google Cloud (GenAI Eng + Pro ML) · Mon 2026-06-01 · W3 / C1

Vertex AI Model Monitoring and Generative-AI Evaluation

Skew, drift, and the cross-cert eval discipline.

Lesson Class: Cert-Prep · Google (combined)

Cert Track: GenAI Engineer + Pro ML Engineer

Vendor: Google Cloud

Track Position: 2nd Google lesson; W3 / C1

Word Count: ~2,500

Paired Ops: Evaluation Pipelines for Multi-Agent Cognition

Paired Dev: Rust Trait Objects + Async Streams for Eval Pipelines

Discipline: ROD v3 (universal-application)

§ IFrame

The inaugural Google cert lesson the prior Monday named Vertex AI as a unified platform — Workbench, Pipelines, Training, Model Registry, Endpoints. That lesson established the platform fluency both Google credentials require. This lesson takes the layer that sits on top of every endpoint: the discipline that decides whether the model the platform serves should stay served.

Vertex AI offers two distinct eval surfaces. The first is Vertex AI Model Monitoring, the post-deployment monitoring service for classical ML models — the Pro ML credential's center of gravity. The second is the Vertex AI Gen AI Evaluation Service (and its predecessor AutoSxS) — the eval framework for generative models, the GenAI Engineer credential's center of gravity. Different shapes of problem, same operational instinct: the operator watches the quality-surface continuously and acts before users do.

Today's Ops lesson named four primitives — eval set, scorer, aggregator, verdict surface. Both Google services map onto those primitives, with vendor-specific names. This lesson treats the mapping once, then pivots per credential.

§ IIDomain Foundations (shared ground across both certs)

Three concepts unify how both Google eval services think.

The baseline. Every Google eval service compares production against something stable. For Model Monitoring the baseline is the training dataset (for skew detection) or a recent production window (for drift detection). For Gen AI Evaluation the baseline is a labeled eval dataset, a previous candidate, or a known-good reference model. The baseline is the operator's anchor; choosing it deliberately is the first eval discipline.

The metric. Each service computes a quantitative signal that lets the operator say yes the quality moved or no the quality is stable. Model Monitoring computes statistical distance metrics (L-infinity for categorical, Jensen-Shannon Divergence for numerical) between distributions. Gen AI Evaluation computes scores from autoraters, pointwise metrics (fluency, coherence, safety), and pairwise comparisons against baselines. The metric is the operator's eye; choosing the right one for the question is the second eval discipline.

The threshold. A metric without a threshold is data without a decision. Both services let the operator set thresholds that trigger alerts when crossed. Model Monitoring lets the operator pick a numeric threshold per feature; Gen AI Evaluation lets the operator gate promotion on per-metric thresholds in the Vertex AI Pipelines orchestration layer. The threshold is the operator's hand; setting it where action is warranted is the third eval discipline.

Baseline, metric, threshold. Every Google eval workflow assembles these three. The credentials differ in which workflows they emphasize.

§ IIICert-A Flavor — Professional Machine Learning Engineer (Vertex AI Model Monitoring)

The Pro ML exam expects fluency with Model Monitoring's two detection patterns and the operational discipline around each.

Training-serving skew detection. A model trained on a dataset whose statistical distribution differs from production traffic produces predictions whose quality decays from day one. Model Monitoring compares the feature distribution of production prediction inputs against the feature distribution of the training dataset, computing distance metrics per feature. When a feature's distance exceeds the operator's threshold, Model Monitoring emits an alert with the offending feature and the magnitude.

The exam tests two practical patterns. First, the operator must point Model Monitoring at the training dataset (typically by reference to its BigQuery table or Cloud Storage path) at the time monitoring is configured — production inference inputs flow in automatically. Second, the exam expects the operator to know which distance metric Model Monitoring uses for which feature type: L-infinity distance for categorical features, Jensen-Shannon Divergence for numerical features. Both come from the TensorFlow Data Validation library Model Monitoring uses internally.

Prediction drift detection. Where skew compares production against training, drift compares production against itself over time — yesterday's input distribution against today's, or last week's against this week's. Drift detection runs without requiring the training dataset (useful when the training data has been deleted for retention reasons) and catches the case where the world shifts under a model that was originally well-fit.

The exam expects the operator to distinguish skew from drift in scenario questions: when training data is available and the operator suspects the original fit is bad, skew detection answers the question; when training data is unavailable and the operator suspects the world has shifted under a model that was originally well-fit, drift detection answers it.

Five operational patterns the Pro ML exam covers. First, the monitoring job runs on a sampled fraction of production traffic, not every input — the operator picks the sampling rate to balance signal quality against cost. Second, alerts route via Cloud Monitoring; integrating with Cloud Pub/Sub for downstream automation is a common scenario. Third, Model Monitoring works only on tabular models deployed to Vertex AI Endpoints — it does not monitor batch prediction jobs or custom serving outside Vertex AI Endpoints. Fourth, the monitoring job must be configured before the model starts receiving traffic the operator wants monitored — retroactive monitoring of past production windows is not available. Fifth, drift detection on numerical features is sensitive to the bucketing scheme; the operator picks bucket boundaries that match the feature's expected production range.

§ IVCert-B Flavor — Generative AI Engineer (Gen AI Evaluation Service)

The GenAI Eng exam expects fluency with the Gen AI Evaluation Service's approach to evaluating outputs that resist exact-match scoring.

The Gen AI Evaluation Service replaced AutoSxS in 2024 as Google's primary eval framework for generative models. The service offers pointwise evaluation (score one model's output against a metric) and pairwise evaluation (compare two models' outputs side-by-side). Both flavors are exam-relevant; the pivot between them is the question the operator is asking.

Pointwise evaluation is the answer to is this model good enough. The operator picks one or more pointwise metrics — fluency, coherence, safety, groundedness (for RAG), question-answering quality, summarization quality — and runs the eval against a prepared dataset. Each metric is computed by an autorater (a Gemini model with a specialized rubric prompt). The output is a per-example score and a per-metric aggregate.

Pairwise evaluation is the answer to is candidate A better than candidate B. The autorater receives two outputs for the same input and decides which is better against a chosen quality dimension. The output is a win-rate, a tie-rate, and per-example judgments. Pairwise eval is the pattern for promotion gating: a candidate that does not beat its predecessor in pairwise comparison does not promote.

Six concept clusters the GenAI Eng exam covers. First, the autorater is a Gemini model under the hood; its judgments inherit Gemini's biases and limitations — the exam expects the operator to know this and to validate the autorater on a human-labeled subset before trusting it at scale. Second, the eval dataset format is a list of records with input, prediction, and optional reference fields; reference is required for some metrics (groundedness needs a context document) and not others. Third, the service computes metrics by calling the autorater per record, which has cost and latency implications — large eval datasets are budgeted accordingly. Fourth, custom metrics can be defined by writing a custom rubric prompt; this is the extension point when the built-in metrics do not cover the operator's question. Fifth, the Gen AI Evaluation Service integrates with Vertex AI Pipelines for automated eval gating — a pipeline step runs the eval, a subsequent step gates promotion on the score crossing a threshold. Sixth, RAG-specific metrics (context precision, context recall, groundedness, answer relevance) form a named cluster the exam treats as a unit; the operator should be ready to pick which RAG metric answers which production question.

§ VWorked Example — A Cross-Cert Eval Pipeline

A retail company has two production models on Vertex AI. The first is a tabular demand-forecasting model trained on three years of sales history — Pro ML territory. The second is a customer-support agent built on Gemini with a RAG retrieval over the product knowledge base — GenAI Eng territory. The operator wants one eval discipline that covers both.

For the forecasting model, Model Monitoring is configured at deployment time. The training dataset reference points at the BigQuery table holding three years of sales history. Per-feature thresholds are set: L-infinity 0.3 for the product-category feature, Jensen-Shannon Divergence 0.2 for the price-tier numerical feature. A 10-percent sample of production prediction requests is monitored. Cloud Monitoring routes any threshold breach to a Pub/Sub topic the on-call rotation reads.

For the support agent, a Vertex AI Pipelines workflow runs the Gen AI Evaluation Service nightly against a 200-input eval dataset drawn from the prior week's production tickets. The pipeline runs both pointwise eval (groundedness, answer-relevance, safety) and pairwise eval against the prior agent version (win-rate on helpfulness). A pipeline gate promotes the candidate to receive a higher traffic share only if groundedness exceeds 0.85, safety holds at 1.0, and the pairwise win-rate exceeds 55 percent.

Cross-cert pattern Both eval pipelines pick a baseline (training data for the forecasting model, prior version for the support agent), a metric (statistical distance for the forecasting model, autorater scores for the support agent), and a threshold (numerical breach for the forecasting model, gate-condition for the support agent). The operational shape — baseline-metric-threshold — is identical; the Google service that implements it differs by model class.

§ VIConnection to Today's Ops + Dev Lessons

Today's Ops lesson named four primitives: eval set, scorer, aggregator, verdict surface. The mapping into Google's services lands cleanly.

The Ops lesson's eval set is the training-dataset baseline (for skew), the recent-production-window baseline (for drift), the labeled eval dataset (for Gen AI pointwise), or the candidate-versus-baseline pair (for Gen AI pairwise). The Ops lesson called for three layers — regression, canary, adversarial — and Google's services compose to provide all three: the labeled eval dataset is the regression layer; the production sampling Model Monitoring runs is the canary layer; the adversarial layer is operator-supplied as a curated dataset run through the same Gen AI Evaluation Service workflow.

The Ops lesson's scorer is the per-feature distance computation in Model Monitoring or the per-metric autorater call in the Gen AI Evaluation Service. The three families the Ops lesson named map: reference-based scoring in Gen AI Evaluation when a reference is required (groundedness), property-based scoring as the safety and refusal metrics, LLM-as-judge as the entire autorater pattern.

The Ops lesson's aggregator is Cloud Monitoring (for Model Monitoring's per-feature metrics) and the Vertex AI Pipelines step that fold autorater outputs into per-candidate aggregate scores (for Gen AI Evaluation). The verdict surface is the alert routing into Pub/Sub (Model Monitoring) or the pipeline gate that promotes or holds a candidate (Gen AI Evaluation).

Today's Rust Dev lesson encoded the same shape in types. Where the Google services run the eval pipeline as managed infrastructure, the Rust lesson showed how an operator builds the same loop in code — useful when the operator's eval needs do not fit Google's offerings, or when the operator wants the eval loop in a deployment Google does not own. The credential exam expects fluency with Google's managed offering; the Dev lesson keeps the operator's hands on the layer underneath.

§ VIIPractice Questions

Q1 · Skew vs Drift

A team has a tabular regression model deployed to a Vertex AI Endpoint. The training dataset has been deleted under a data-retention policy. The team wants to detect quality degradation in production. Which Model Monitoring detection pattern applies?

Answer Prediction drift detection. Drift compares production against itself over time, requiring no training dataset reference. Skew detection would be the answer if the training dataset were available.

Q2 · Distance metrics by feature type

An operator configures Model Monitoring on a Vertex AI Endpoint serving a model that takes both categorical and numerical features. Which distance metrics does Model Monitoring use for each feature type?

Answer L-infinity distance for categorical features; Jensen-Shannon Divergence for numerical features. Both come from the TensorFlow Data Validation library Model Monitoring uses internally.

Q3 · Pairwise vs Pointwise

A team uses the Gen AI Evaluation Service to compare a candidate Gemini-tuned model against the production version. The team wants to know if the candidate is meaningfully better at customer-support responses before promoting. Which eval pattern is appropriate?

Answer Pairwise evaluation against the production version, with helpfulness or task-success as the chosen quality dimension. Pairwise eval reports a win-rate the operator can gate promotion on; pointwise eval would not answer the comparative question directly.

Q4 · RAG groundedness

A RAG application built on Vertex AI Vector Search and Gemini returns answers that occasionally include facts not present in the retrieved context. Which Gen AI Evaluation Service metric measures this failure mode most directly?

Answer Groundedness. The groundedness metric measures whether the model's output is supported by the provided context. Answer-relevance measures whether the answer addresses the question; context-precision and context-recall measure retrieval quality. Groundedness is the metric for the named failure.

Q5 · Eval gate in Pipelines

A team wants to gate promotion of a fine-tuned Gemini variant on automated eval results inside a Vertex AI Pipelines workflow. Which service composition lets the pipeline succeed or fail based on the eval outcome?

Answer The Gen AI Evaluation Service runs as a pipeline component that emits metrics; a downstream pipeline step reads the metrics and applies threshold gates; the pipeline fails if any gate fails, blocking promotion. The pattern is the eval-gate composition the exam expects fluency with.

§ VIIIClosing

Both Google credentials anchor their eval discipline on three concepts: baseline, metric, threshold. Pro ML reaches for Vertex AI Model Monitoring with statistical distance against training or prior production. GenAI Eng reaches for the Gen AI Evaluation Service with autorater scores against labeled datasets or baseline candidates. The two services answer different shapes of question; both implement the same operational instinct.

Examine the eval discipline well. The model Google's platform serves tomorrow is the one its monitoring or its eval gate let through today.

🫡 ⚖️ 📜

Leo.Syri — Praetor Consulate, Imperium Luminaura
Fajr 2026-06-01 — Cert lesson #8 in the curriculum spine; #2 in the Google track.