post · 2025-09-12 · 9 min read

Building a programmatic data-quality platform that teams actually adopt

#data-engineering#data-quality#architecture#platform-engineering

Most data-quality tooling fails the same way. A central data-platform team builds a checker, mandates it, and then watches as half the org routes around it because it does not fit their pipeline. The other half adopts it grudgingly and stops adding new checks the moment the maintainer leaves. Six months later, schema drift is again being discovered through “the dashboard looks weird.”

This post is the architectural pattern that avoids that failure mode, written tool-agnostic so the ideas are portable. Wherever I say a validation engine, a relational database, a distributed compute layer, or an artifact registry, you can substitute whatever your team already runs. The shape of the platform is what matters, not the brand of any one component.

The mistake most data-quality tools make

They centralise the data. They want every dataset to flow through some central validation service before it lands anywhere consumable. This is wrong for two reasons.

First, data movement is expensive and political. Pulling tenant data into a central location means a new storage tier, a new compliance review, and a new security boundary. By the time you finish, you have a separate platform team whose job is just to keep the central pipe running.

Second, it does not match how teams already work. Teams already have their pipelines, their workspaces, their service principals, their access patterns. A central service forces a second copy of all of that. Adoption stalls.

The fix: do not centralise data. Centralise contracts. Centralise the rules, the result history, the alerting, and the UI. Let the data stay where it lives.

The pattern, three planes

┌─ data plane ──────────────────────────────────┐
│  team A workspace      team B workspace       │
│  ┌────────────┐        ┌────────────┐         │
│  │ pipeline   │        │ pipeline   │         │
│  │ + SDK      │        │ + SDK      │         │
│  └─────┬──────┘        └─────┬──────┘         │
└────────┼─────────────────────┼────────────────┘
         │ POST results        │ POST results
         ▼                     ▼
┌─ control plane ───────────────────────────────┐
│  REST API + UI + fact/dim store + alerting    │
└───────────────────────────────────────────────┘
         ▲
         │ install SDK
┌─ distribution plane ──────────────────────────┐
│  versioned SDK packages, dev + release        │
└───────────────────────────────────────────────┘

Three independent concerns:

Data plane: validation runs inside each team’s existing infrastructure. Their compute, their access, their data.
Control plane: the central platform with the rules, history, UI, and alerting.
Distribution plane: how the SDK gets to teams.

The thing that makes this pattern work is that each plane has clear ownership and a clear API to the next. Data plane only emits result records. Control plane only stores them and renders them. Distribution plane only ships versioned artefacts.

The SDK pipeline

The SDK is the load-bearing piece for adoption. Make it small, stage-explicit, and easy to test. Four stages, each replaceable:

extractor → validator → results storer → alert builder

Extractor turns a config (a connection string, a path, a catalog reference) into an in-memory dataset (a DataFrame, a stream, whatever your compute layer speaks). Adding a new data source means adding a new extractor, nothing else. Hide the connection details behind a factory.

Validator runs validation against the dataset and returns structured results. Pick a validation engine that has a healthy catalog of built-in checks and a clean extension point for custom rules; do not write your own validation primitive language unless you are sure you can not buy one off the shelf.

Results Storer POSTs results to the control plane via a REST API. Make this asynchronous-friendly and retry-on-failure, because batch jobs run on flaky networks and a one-second blip should not lose a result.

Alert Builder filters results to threshold breaches and posts to a configured webhook. The reason this is its own stage, not bundled into the validator, is because alerting policy changes more often than validation logic. Splitting them lets you tweak message format without touching the validation core.

Picking a validation engine

You have a few categories to choose between, and the right pick depends mostly on the shape of your data and your team’s habits:

Style	Pick when
Mature batch-validation engine	You want a large catalog of pre-built checks, a clean subclass-based extension API for custom rules, and integrations with whatever distributed compute you run. Best for a multi-tenant platform validating large tables across teams.
Schema-first type-safe library	Your validation is mostly about declaring what columns and types should exist, not custom statistical checks. Best when teams are comfortable expressing rules as static type declarations.
Custom schema-and-rules format	Your data is event-shaped (one schema per event type), you want a single declarative file per event, and you do not need cross-row checks.

The contributor model is what determines whether an engine scales: a new check should be a small, local change (subclass / function / config block) that drops into a registry directory and becomes usable by every team after the next release. If adding a new rule type means editing a core engine file, you have picked the wrong engine.

The fact/dim store is the platform

This is the architectural decision most teams skip and regret. Validation results in flat logs answer no useful question. They cannot tell you:

“Null-rate trend on column X over the last 30 days.”
“Which checks have failed more than 5% of runs this week.”
“Per-team SLA compliance over the quarter.”
“Schema-drift incidents per month, ranked by data asset.”

A dimensional schema in a relational database turns all of these into normal SQL:

fact_validation_results
  (check_id, batch_id, asset_id, team_id, run_at,
   passed, observed_value, threshold, severity)

dim_validation_checks
  (check_id, name, type, parameters, version, owner_team)

dim_data_assets
  (asset_id, source, location, owner_team, description)

dim_teams
  (team_id, name, alert_webhook, sla_target)

Every dashboard, every alert, every SLA report rolls up from this fact table. The dim tables encode the rules and the taxonomy; the fact table encodes the history.

A few rules of thumb for the fact table:

Insert-only, never update. If a check fails twice today, store both runs. Never overwrite.
Upsert the dimensions when configs change. A renamed check should keep the same check_id so historical results stay queryable.
Index run_at and team_id because the most common queries are time-windowed and team-scoped.
Denormalise sparingly. Storing the threshold inside the fact row is worth the duplication because thresholds change over time and you want the historical comparison to use the threshold that was active at run time.

Two paths into the platform

Adoption doubles when you ship two interfaces over the same backend.

Path A, no-code UI. Team leads onboard a team, point at their workspace, pick rules from the catalog, set thresholds, configure an alert webhook. They never see code.

Path B, developer SDK. Engineers write a config in their notebook (YAML, TOML, whatever), instantiate the validator, run it against an in-memory dataset, and let the SDK push results to the same backend. They never see the UI unless they want to.

Both paths land in the same fact table. Both alert through the same channels in the same format. The platform owner sees one dashboard with results from both, indistinguishable. This dual interface is the difference between a tool the data team uses and a tool the organisation uses.

Versioned distribution, no surprises

A platform SDK that consumers install needs a release process that nobody dreads. Two practices that pay back instantly:

Auto-version by commit prefix:

Commit prefix	Version bump
any other	patch
`feat:` or `feature:`	minor
`BREAKING CHANGE:`, `feat!:`, `fix!:`	major

A small CI step parses the commit, computes the next version, tags the release. Engineers never argue about version numbers; the commit message is the contract.

Dual-channel publishing: a dev build on every main merge, a release build on tagged releases. Teams pin to the release channel for production jobs and use the dev channel to test new checks. No surprise upgrades.

Custom rules as a first-class contribution

If the built-in checks are all teams ever use, your platform is too restrictive and they will route around you. Make the contribution path obvious. The mental model:

A new rule is a small file (a class, a function, a config block) that drops into a registry directory.
CI picks it up automatically and publishes the next dev build.
The rule is now usable in any team’s config by name.
A test suite mocks the underlying compute layer so the CI matrix does not need a full distributed-compute installation.

If a rule contribution takes a junior engineer more than half a day end-to-end (write rule → write test → open PR → see it land in dev), you have made it too hard. Aim for a 30-minute contribution path.

Alerting that does not get muted

Alerts that fire on every check failure get muted. Once muted, they are dead. Some rules:

Threshold-based, not boolean. “More than 5% of rows failed this check” beats “any row failed”.
One alert per breach, not one per row. Aggregate inside the alert builder.
Severity tiering. P1 (data is unusable) goes to a separate channel from P3 (small drift, worth a look). Otherwise everything blends.
Rich format, not plain text. A message with check name, asset, observed-vs-threshold, and a link back to the dashboard saves the receiver three clicks of context-finding.
Per-team destinations. Configure alerting at the team dimension, not globally. Teams own their own noise.

What to skip

A few things that look like they would help but cost more than they save:

Do not build your own validation primitive language. Whatever validation engine you pick is your DSL. Wrapping it adds a layer to maintain forever.
Do not centralise the data. Already covered, still worth repeating.
Do not build a workflow scheduler. Use whatever your org already has for batch jobs. The platform triggers; it does not orchestrate.
Do not make rules editable only by data engineers. If a non-engineer cannot add a null check through the UI, the platform has failed at its main job.

What good looks like

Six months in, you should be able to point at a single fact table and answer:

How many checks are running across the org this week?
Which data asset has the worst quality trend?
Which team has the most checks contributed?
What was the null rate on column X on this date in February?

If you can, the platform has succeeded. The data-quality conversation in your org has shifted from “the dashboard looks weird” to “let me pull last week’s null-rate trend.” That shift is the whole point.

Closing

Programmatic data quality at scale is not a checking library; it is a contract platform. The contract is what every team agrees on (rules, results schema, alerting destinations). The data stays where it lives. The SDK is small, the control plane is boring, the fact/dim store is the product. Build it that way and teams will adopt it without being told to.

The choice of validation engine, database, web framework, message format — each of those is a substitution that your team’s existing stack can answer for you. The shape of the platform is what is portable.