project · 2025-2026

Data Quality Platform, full-stack monitoring across Databricks teams

A multi-tenant data-quality platform: a Python SDK that runs inside each team's Databricks workspace, a FastAPI + Streamlit control plane, a fact/dim PostgreSQL store for results, threshold-based Slack alerting, and a no-code UI for non-engineers. Built on Great Expectations with a custom-expectations API for team-contributed rules.

A platform that lets every data team add validation to their pipelines without rolling their own framework. Teams either use the no-code UI, or pip install the SDK in a Databricks notebook and run validation against a Spark DataFrame. Either way, results land in the same fact/dim store and the same Slack alerting flow.

Sister project to the PPDA Data Validator. The validator was a small Pydantic library for one team. This is the platform that scaled the same idea to every team that consumes data, with a control plane, history, alerting, and a self-service UI.

What problem this solves

At scale, every team owns its own pipelines, its own Databricks workspace, its own SLAs. Without a platform, each team rolls its own validation in its own notebooks. Result: inconsistent rules, no shared catalog, no trend tracking, no alerting standards, and schema drift discovered three days late in a BI tool. The platform is the answer to “how do we scale data quality across many teams without forcing a central team to write everyone’s checks.”

The architecture, three planes

1. Data plane (validation runs inside each team’s existing Databricks). A versioned Python wheel is pip installed into team notebooks and jobs. The SDK pipeline has four stages:

Crucially, data never leaves the team’s workspace. Only validation results do. No data movement, no compliance overhead, no extra storage tier.

2. Control plane (the platform itself, on an Azure VM behind NGINX with Microsoft AAD via MSAL).

3. Distribution plane. The SDK auto-versions by commit prefix:

Commit prefixBump
any other messagepatch
feat: or feature:minor
BREAKING CHANGE:, feat!:, fix!:major

Two channels: a dev wheel on every main merge, a release wheel on tagged releases. Both pushed to a private artifact registry. Teams pin versions and upgrade deliberately.

Two usage paths

The thing that actually makes it adopted: same platform, two interfaces.

No-code path (team leads, analysts, anyone non-engineering):

  1. Onboard team and members in the UI.
  2. Add Databricks workspace URL, service principal, secret scope.
  3. Pick rules from the 50+ built-in Great Expectations checks.
  4. Set thresholds and Slack webhook.
  5. Schedule via Databricks Jobs API, triggered from the backend.

Developer SDK path (data engineers writing pipeline code):

  1. pip install the SDK in a Databricks notebook.
  2. Author checks in YAML, validate the schema, register them via the platform API to get back check IDs.
  3. Run DeveloperValidator.validate(df) against a Spark DataFrame.
  4. Results POST back automatically. Same backend, same UI, same alerting.

A single dashboard shows results from both paths side by side, so the platform owner sees one cross-team view.

The fact/dim store is load-bearing

Validation results in flat logs cannot answer the questions a data org actually has:

A fact table (fact_validation_results) keyed by (check_id, batch_id, timestamp) plus dim tables (dim_validation_checks, dim_data_assets, dim_teams) turns all of these into normal SQL. Every Streamlit chart, every alert threshold, every SLA report rolls up from here. Without this design choice, you have logs. With it, you have a data-quality data product.

Extensibility, custom rules

When the 50+ built-in checks are not enough, teams contribute new rules by subclassing Great Expectations base classes:

from great_expectations.expectations.expectation import ColumnMapExpectation
class ExpectColumnValuesToMatchPostalCode(ColumnMapExpectation):
"""Postal code matches a country-specific regex."""
map_metric = "column_values.match_regex"
success_keys = ("country",)
default_kwarg_values = {"country": "NL"}

The new expectation drops into a registry directory, gets referenced by name in YAML configs, and is available to every team after the next dev wheel publishes. CI runs the test suite (no Spark installation required, PySpark is mocked) before merging.

Stack

My role

I architected this end-to-end. Stakeholder requirements gathering, conceptual design, technical design, system boundaries. The implementation was delivered by junior engineers on the team under my technical guidance, code review, and design oversight. Specifically I owned:

Why this earns a spot in projects

The thing that makes it real, not a toy: the platform does not centralise data. It centralises contracts. Each team keeps owning their workspace, their service principal, their data. The platform owns the rules, the results history, the alerts, and the UI. Centralising data is a multi-year migration; centralising contracts is a pip install. That’s why teams actually adopt it.

← all projects