Vertex AI Model Monitoring and Generative-AI Evaluation
Skew, drift, and the cross-cert eval discipline.
§ IFrame
The inaugural Google cert lesson the prior Monday named Vertex AI as a unified platform — Workbench, Pipelines, Training, Model Registry, Endpoints. That lesson established the platform fluency both Google credentials require. This lesson takes the layer that sits on top of every endpoint: the discipline that decides whether the model the platform serves should stay served.
Vertex AI offers two distinct eval surfaces. The first is Vertex AI Model Monitoring, the post-deployment monitoring service for classical ML models — the Pro ML credential's center of gravity. The second is the Vertex AI Gen AI Evaluation Service (and its predecessor AutoSxS) — the eval framework for generative models, the GenAI Engineer credential's center of gravity. Different shapes of problem, same operational instinct: the operator watches the quality-surface continuously and acts before users do.
Today's Ops lesson named four primitives — eval set, scorer, aggregator, verdict surface. Both Google services map onto those primitives, with vendor-specific names. This lesson treats the mapping once, then pivots per credential.
§ IIDomain Foundations (shared ground across both certs)
Three concepts unify how both Google eval services think.
The baseline. Every Google eval service compares production against something stable. For Model Monitoring the baseline is the training dataset (for skew detection) or a recent production window (for drift detection). For Gen AI Evaluation the baseline is a labeled eval dataset, a previous candidate, or a known-good reference model. The baseline is the operator's anchor; choosing it deliberately is the first eval discipline.
The metric. Each service computes a quantitative signal that lets the operator say yes the quality moved or no the quality is stable. Model Monitoring computes statistical distance metrics (L-infinity for categorical, Jensen-Shannon Divergence for numerical) between distributions. Gen AI Evaluation computes scores from autoraters, pointwise metrics (fluency, coherence, safety), and pairwise comparisons against baselines. The metric is the operator's eye; choosing the right one for the question is the second eval discipline.
The threshold. A metric without a threshold is data without a decision. Both services let the operator set thresholds that trigger alerts when crossed. Model Monitoring lets the operator pick a numeric threshold per feature; Gen AI Evaluation lets the operator gate promotion on per-metric thresholds in the Vertex AI Pipelines orchestration layer. The threshold is the operator's hand; setting it where action is warranted is the third eval discipline.
Baseline, metric, threshold. Every Google eval workflow assembles these three. The credentials differ in which workflows they emphasize.
§ IIICert-A Flavor — Professional Machine Learning Engineer (Vertex AI Model Monitoring)
The Pro ML exam expects fluency with Model Monitoring's two detection patterns and the operational discipline around each.
Training-serving skew detection. A model trained on a dataset whose statistical distribution differs from production traffic produces predictions whose quality decays from day one. Model Monitoring compares the feature distribution of production prediction inputs against the feature distribution of the training dataset, computing distance metrics per feature. When a feature's distance exceeds the operator's threshold, Model Monitoring emits an alert with the offending feature and the magnitude.
The exam tests two practical patterns. First, the operator must point Model Monitoring at the training dataset (typically by reference to its BigQuery table or Cloud Storage path) at the time monitoring is configured — production inference inputs flow in automatically. Second, the exam expects the operator to know which distance metric Model Monitoring uses for which feature type: L-infinity distance for categorical features, Jensen-Shannon Divergence for numerical features. Both come from the TensorFlow Data Validation library Model Monitoring uses internally.
Prediction drift detection. Where skew compares production against training, drift compares production against itself over time — yesterday's input distribution against today's, or last week's against this week's. Drift detection runs without requiring the training dataset (useful when the training data has been deleted for retention reasons) and catches the case where the world shifts under a model that was originally well-fit.
The exam expects the operator to distinguish skew from drift in scenario questions: when training data is available and the operator suspects the original fit is bad, skew detection answers the question; when training data is unavailable and the operator suspects the world has shifted under a model that was originally well-fit, drift detection answers it.
Five operational patterns the Pro ML exam covers. First, the monitoring job runs on a sampled fraction of production traffic, not every input — the operator picks the sampling rate to balance signal quality against cost. Second, alerts route via Cloud Monitoring; integrating with Cloud Pub/Sub for downstream automation is a common scenario. Third, Model Monitoring works only on tabular models deployed to Vertex AI Endpoints — it does not monitor batch prediction jobs or custom serving outside Vertex AI Endpoints. Fourth, the monitoring job must be configured before the model starts receiving traffic the operator wants monitored — retroactive monitoring of past production windows is not available. Fifth, drift detection on numerical features is sensitive to the bucketing scheme; the operator picks bucket boundaries that match the feature's expected production range.
§ IVCert-B Flavor — Generative AI Engineer (Gen AI Evaluation Service)
The GenAI Eng exam expects fluency with the Gen AI Evaluation Service's approach to evaluating outputs that resist exact-match scoring.
The Gen AI Evaluation Service replaced AutoSxS in 2024 as Google's primary eval framework for generative models. The service offers pointwise evaluation (score one model's output against a metric) and pairwise evaluation (compare two models' outputs side-by-side). Both flavors are exam-relevant; the pivot between them is the question the operator is asking.
Pointwise evaluation is the answer to is this model good enough. The operator picks one or more pointwise metrics — fluency, coherence, safety, groundedness (for RAG), question-answering quality, summarization quality — and runs the eval against a prepared dataset. Each metric is computed by an autorater (a Gemini model with a specialized rubric prompt). The output is a per-example score and a per-metric aggregate.
Pairwise evaluation is the answer to is candidate A better than candidate B. The autorater receives two outputs for the same input and decides which is better against a chosen quality dimension. The output is a win-rate, a tie-rate, and per-example judgments. Pairwise eval is the pattern for promotion gating: a candidate that does not beat its predecessor in pairwise comparison does not promote.
Six concept clusters the GenAI Eng exam covers. First, the autorater is a Gemini model under the hood; its judgments inherit Gemini's biases and limitations — the exam expects the operator to know this and to validate the autorater on a human-labeled subset before trusting it at scale. Second, the eval dataset format is a list of records with input, prediction, and optional reference fields; reference is required for some metrics (groundedness needs a context document) and not others. Third, the service computes metrics by calling the autorater per record, which has cost and latency implications — large eval datasets are budgeted accordingly. Fourth, custom metrics can be defined by writing a custom rubric prompt; this is the extension point when the built-in metrics do not cover the operator's question. Fifth, the Gen AI Evaluation Service integrates with Vertex AI Pipelines for automated eval gating — a pipeline step runs the eval, a subsequent step gates promotion on the score crossing a threshold. Sixth, RAG-specific metrics (context precision, context recall, groundedness, answer relevance) form a named cluster the exam treats as a unit; the operator should be ready to pick which RAG metric answers which production question.
§ VWorked Example — A Cross-Cert Eval Pipeline
A retail company has two production models on Vertex AI. The first is a tabular demand-forecasting model trained on three years of sales history — Pro ML territory. The second is a customer-support agent built on Gemini with a RAG retrieval over the product knowledge base — GenAI Eng territory. The operator wants one eval discipline that covers both.
For the forecasting model, Model Monitoring is configured at deployment time. The training dataset reference points at the BigQuery table holding three years of sales history. Per-feature thresholds are set: L-infinity 0.3 for the product-category feature, Jensen-Shannon Divergence 0.2 for the price-tier numerical feature. A 10-percent sample of production prediction requests is monitored. Cloud Monitoring routes any threshold breach to a Pub/Sub topic the on-call rotation reads.
For the support agent, a Vertex AI Pipelines workflow runs the Gen AI Evaluation Service nightly against a 200-input eval dataset drawn from the prior week's production tickets. The pipeline runs both pointwise eval (groundedness, answer-relevance, safety) and pairwise eval against the prior agent version (win-rate on helpfulness). A pipeline gate promotes the candidate to receive a higher traffic share only if groundedness exceeds 0.85, safety holds at 1.0, and the pairwise win-rate exceeds 55 percent.
§ VIConnection to Today's Ops + Dev Lessons
Today's Ops lesson named four primitives: eval set, scorer, aggregator, verdict surface. The mapping into Google's services lands cleanly.
The Ops lesson's eval set is the training-dataset baseline (for skew), the recent-production-window baseline (for drift), the labeled eval dataset (for Gen AI pointwise), or the candidate-versus-baseline pair (for Gen AI pairwise). The Ops lesson called for three layers — regression, canary, adversarial — and Google's services compose to provide all three: the labeled eval dataset is the regression layer; the production sampling Model Monitoring runs is the canary layer; the adversarial layer is operator-supplied as a curated dataset run through the same Gen AI Evaluation Service workflow.
The Ops lesson's scorer is the per-feature distance computation in Model Monitoring or the per-metric autorater call in the Gen AI Evaluation Service. The three families the Ops lesson named map: reference-based scoring in Gen AI Evaluation when a reference is required (groundedness), property-based scoring as the safety and refusal metrics, LLM-as-judge as the entire autorater pattern.
The Ops lesson's aggregator is Cloud Monitoring (for Model Monitoring's per-feature metrics) and the Vertex AI Pipelines step that fold autorater outputs into per-candidate aggregate scores (for Gen AI Evaluation). The verdict surface is the alert routing into Pub/Sub (Model Monitoring) or the pipeline gate that promotes or holds a candidate (Gen AI Evaluation).
Today's Rust Dev lesson encoded the same shape in types. Where the Google services run the eval pipeline as managed infrastructure, the Rust lesson showed how an operator builds the same loop in code — useful when the operator's eval needs do not fit Google's offerings, or when the operator wants the eval loop in a deployment Google does not own. The credential exam expects fluency with Google's managed offering; the Dev lesson keeps the operator's hands on the layer underneath.
§ VIIPractice Questions
§ VIIIClosing
Both Google credentials anchor their eval discipline on three concepts: baseline, metric, threshold. Pro ML reaches for Vertex AI Model Monitoring with statistical distance against training or prior production. GenAI Eng reaches for the Gen AI Evaluation Service with autorater scores against labeled datasets or baseline candidates. The two services answer different shapes of question; both implement the same operational instinct.
Examine the eval discipline well. The model Google's platform serves tomorrow is the one its monitoring or its eval gate let through today.
Fajr 2026-06-01 — Cert lesson #8 in the curriculum spine; #2 in the Google track.