project · 2019-2021

DIME 2.0, Cloud-native data quality platform

Lead engineer on an enterprise data-quality platform with a pluggable connector framework for profiling, validation, and reconciliation across heterogeneous sources. Adopted across multiple internal customer engagements.

#data-engineering #data-quality #plugins #distributed-systems #pyspark

DIME 2.0 is a cloud-native distributed data-quality platform for enterprise data lakes. As lead engineer I designed the pluggable connector framework, each backend (SQL Server, BigQuery, Azure Synapse, Postgres, Delta Lake, S3 / ADLS) is a thin adapter implementing a common Connector interface; profiling, validation, and reconciliation rules run via the same orchestrator regardless of source.

What it does

Profiling: distribution stats, null density, cardinality, inferred PII fields per column.
Validation: declarative YAML rules (range, regex, foreign-key, business-rule); rule packs reusable across pipelines.
Reconciliation: row-level diff between source/target post-migration, with sampling strategies for tables too large for full scans.

Design highlights

Pluggable connector pattern. Adding a new database backend = ~150 lines of adapter code, no orchestrator changes. A team migrating from Oracle to BigQuery wired up DIME for both sides in an afternoon.
Distributed execution via Spark for tables that exceed a single-node compute budget. Profiling that took 6+ hours on a single VM dropped to minutes on a Spark cluster.
Rule packs as code: versioned in git, code-reviewed like any other artefact. Audit trail comes for free.

Stack

Python · PySpark · Apache Airflow · Azure / AWS · pluggable connector framework.

← all projects