project · 2019-2021
DIME 2.0, Cloud-native data quality platform
Lead engineer on an enterprise data-quality platform with a pluggable connector framework for profiling, validation, and reconciliation across heterogeneous sources. Adopted across multiple internal customer engagements.
DIME 2.0 is a cloud-native distributed data-quality platform for enterprise data lakes. As lead engineer I designed the pluggable connector framework, each backend (SQL Server, BigQuery, Azure Synapse, Postgres, Delta Lake, S3 / ADLS) is a thin adapter implementing a common Connector interface; profiling, validation, and reconciliation rules run via the same orchestrator regardless of source.
What it does
- Profiling: distribution stats, null density, cardinality, inferred PII fields per column.
- Validation: declarative YAML rules (range, regex, foreign-key, business-rule); rule packs reusable across pipelines.
- Reconciliation: row-level diff between source/target post-migration, with sampling strategies for tables too large for full scans.
Design highlights
- Pluggable connector pattern. Adding a new database backend = ~150 lines of adapter code, no orchestrator changes. A team migrating from Oracle to BigQuery wired up DIME for both sides in an afternoon.
- Distributed execution via Spark for tables that exceed a single-node compute budget. Profiling that took 6+ hours on a single VM dropped to minutes on a Spark cluster.
- Rule packs as code: versioned in git, code-reviewed like any other artefact. Audit trail comes for free.
Stack
Python · PySpark · Apache Airflow · Azure / AWS · pluggable connector framework.