project · 2025-2026
Data Quality Platform, full-stack monitoring across Databricks teams
A multi-tenant data-quality platform: a Python SDK that runs inside each team's Databricks workspace, a FastAPI + Streamlit control plane, a fact/dim PostgreSQL store for results, threshold-based Slack alerting, and a no-code UI for non-engineers. Built on Great Expectations with a custom-expectations API for team-contributed rules.
A platform that lets every data team add validation to their pipelines without rolling their own framework. Teams either use the no-code UI, or pip install the SDK in a Databricks notebook and run validation against a Spark DataFrame. Either way, results land in the same fact/dim store and the same Slack alerting flow.
Sister project to the PPDA Data Validator. The validator was a small Pydantic library for one team. This is the platform that scaled the same idea to every team that consumes data, with a control plane, history, alerting, and a self-service UI.
What problem this solves
At scale, every team owns its own pipelines, its own Databricks workspace, its own SLAs. Without a platform, each team rolls its own validation in its own notebooks. Result: inconsistent rules, no shared catalog, no trend tracking, no alerting standards, and schema drift discovered three days late in a BI tool. The platform is the answer to “how do we scale data quality across many teams without forcing a central team to write everyone’s checks.”
The architecture, three planes
1. Data plane (validation runs inside each team’s existing Databricks). A versioned Python wheel is pip installed into team notebooks and jobs. The SDK pipeline has four stages:
- Data Extractor: pulls data from Azure Blob (Parquet, CSV, JSON) or Databricks Unity Catalog into a Spark DataFrame.
- GX Validator: builds a Great Expectations suite from config, runs
validate_all()against the DataFrame. - Results Storer: POSTs structured results back to the central backend.
- Alert Builder: filters threshold breaches, formats Block Kit blocks, posts to the team’s Slack webhook.
Crucially, data never leaves the team’s workspace. Only validation results do. No data movement, no compliance overhead, no extra storage tier.
2. Control plane (the platform itself, on an Azure VM behind NGINX with Microsoft AAD via MSAL).
- FastAPI backend with JWT, modular routes (results, validations, teams, alerts, data sources), a Databricks Jobs API client for triggering remote runs, and a PostgreSQL connection pool.
- Streamlit frontend for the no-code path: configure data sources, assets, batch definitions, and checks through forms.
- PostgreSQL holds dimensions (teams, data assets, validation checks) and facts (validation results keyed by check, batch, timestamp).
3. Distribution plane. The SDK auto-versions by commit prefix:
| Commit prefix | Bump |
|---|---|
| any other message | patch |
feat: or feature: | minor |
BREAKING CHANGE:, feat!:, fix!: | major |
Two channels: a dev wheel on every main merge, a release wheel on tagged releases. Both pushed to a private artifact registry. Teams pin versions and upgrade deliberately.
Two usage paths
The thing that actually makes it adopted: same platform, two interfaces.
No-code path (team leads, analysts, anyone non-engineering):
- Onboard team and members in the UI.
- Add Databricks workspace URL, service principal, secret scope.
- Pick rules from the 50+ built-in Great Expectations checks.
- Set thresholds and Slack webhook.
- Schedule via Databricks Jobs API, triggered from the backend.
Developer SDK path (data engineers writing pipeline code):
pip installthe SDK in a Databricks notebook.- Author checks in YAML, validate the schema, register them via the platform API to get back check IDs.
- Run
DeveloperValidator.validate(df)against a Spark DataFrame. - Results POST back automatically. Same backend, same UI, same alerting.
A single dashboard shows results from both paths side by side, so the platform owner sees one cross-team view.
The fact/dim store is load-bearing
Validation results in flat logs cannot answer the questions a data org actually has:
- Null-rate trend on column X over the last 30 days.
- Which checks have failed more than 5% of runs this week.
- Per-team SLA compliance over the quarter.
- Schema-drift incidents per month, ranked by data asset.
A fact table (fact_validation_results) keyed by (check_id, batch_id, timestamp) plus dim tables (dim_validation_checks, dim_data_assets, dim_teams) turns all of these into normal SQL. Every Streamlit chart, every alert threshold, every SLA report rolls up from here. Without this design choice, you have logs. With it, you have a data-quality data product.
Extensibility, custom rules
When the 50+ built-in checks are not enough, teams contribute new rules by subclassing Great Expectations base classes:
from great_expectations.expectations.expectation import ColumnMapExpectation
class ExpectColumnValuesToMatchPostalCode(ColumnMapExpectation): """Postal code matches a country-specific regex.""" map_metric = "column_values.match_regex" success_keys = ("country",) default_kwarg_values = {"country": "NL"}The new expectation drops into a registry directory, gets referenced by name in YAML configs, and is available to every team after the next dev wheel publishes. CI runs the test suite (no Spark installation required, PySpark is mocked) before merging.
Stack
- Validation engine: Great Expectations core API.
- Compute: PySpark on Databricks (team-owned workspaces).
- Backend: FastAPI, JWT, JSON Schema validation on every route.
- Frontend: Streamlit, Microsoft MSAL for SSO.
- Storage: PostgreSQL fact/dim, with connection pooling.
- Alerting: Slack Block Kit via webhook per team.
- Distribution: versioned wheels, auto-bumped by commit convention, dual channel (dev / release).
- Deployment: Docker Compose on an Azure VM behind NGINX, AAD-authenticated.
- Tooling:
uvfor dependency resolution, pytest, GitHub Actions for CI/CD.
My role
I architected this end-to-end. Stakeholder requirements gathering, conceptual design, technical design, system boundaries. The implementation was delivered by junior engineers on the team under my technical guidance, code review, and design oversight. Specifically I owned:
- Stakeholder requirements: discovery sessions with each consuming team to understand their actual data-quality pain (schema drift, late detection, inconsistent alerting), translated into a single platform spec rather than a per-team grab-bag.
- Conceptual design: the three-plane model (data plane runs in tenant Databricks, control plane owns rules + history + alerting, distribution plane ships versioned SDK). The thesis “centralise contracts, not data” is the design choice that made the platform adoptable without forcing data migrations.
- Technical design: the four-stage SDK contract (extractor → validator → results storer → alert builder), the fact/dim PostgreSQL schema with upsert semantics on
dim_validation_checks(so check definitions can evolve without losing history), the custom-expectations contribution flow, the commit-prefix auto-versioning convention, and the developer-SDK path. - Engineering leadership: design reviews, technical oversight on PRs, mentoring the junior engineers shipping the implementation, unblocking decisions where the spec needed sharpening in flight.
Why this earns a spot in projects
The thing that makes it real, not a toy: the platform does not centralise data. It centralises contracts. Each team keeps owning their workspace, their service principal, their data. The platform owns the rules, the results history, the alerts, and the UI. Centralising data is a multi-year migration; centralising contracts is a pip install. That’s why teams actually adopt it.