tag

#data-engineering

10 items · 9 projects · 1 feed item

projects

project 2025-2026

Data Quality Platform, full-stack monitoring across Databricks teams

A multi-tenant data-quality platform: a Python SDK that runs inside each team's Databricks workspace, a FastAPI + Streamlit control plane, a fact/dim PostgreSQL store for results, threshold-based Slack alerting, and a no-code UI for non-engineers. Built on Great Expectations with a custom-expectations API for team-contributed rules.

#data-engineering #data-quality #great-expectations

project 2023-2024

Vehicle-telemetry silver-layer ETL

Refactored a vehicle-telemetry processing pipeline. Transforms raw nested JSON / protobuf from millions of in-car navigation clients into clean Delta tables that power navigation-quality dashboards. Designed for query simplicity: PMs answer 'route success rate in country X this week' with a single SELECT.

#data-engineering #databricks #pyspark

project 2023-2024

PPDA Data Validator, schema-aware validation for analytics pipelines

A Pydantic-based data-validation library for the team that runs API analytics. Engineers point it at a JSON event, it generates a schema, they edit it to add field-level rules (null, range, allowed-values, custom), and the orchestrator runs validations every ingest. Distributed as an installable wheel.

#data-engineering #python #pydantic

project 2022-2024

ETL template for API analytics pipelines

Refactored the inconsistent set of ETL pipelines feeding TomTom's API analytics into a single OOP template, distributed as an internal Python package. Engineers subclass the base, get live Azure Data Explorer connections for free, and only write the business logic in extract / transform / load. Result: consistent, version-controlled pipelines across volume, response-time, and error-rate use cases.

#data-engineering #etl #refactor

project 2022-2024

Developer-portal analytics APIs

REST API layer that powers the analytics dashboards on developer.tomtom.com. Sits on top of an Azure Data Explorer (Kusto) backend that ingests every API call across TomTom's developer products. Volume reports, response-time percentiles, error-rate breakdowns, and per-product usage all flow through this layer.

#data-engineering #fastapi #kusto

project 2021

Enterprise data lake on GCP, from scratch

Architected an enterprise data lake on Google Cloud (BigQuery, Cloud Storage, Cloud Composer / Airflow) at Oyo Vacation Homes. Designed a metadata-driven ETL framework in PySpark that cut new-pipeline development time by ~70%. Established schema-versioning and quality-validation patterns the team still uses.

#data-engineering #bigquery #pyspark

project 2019-2021

ADBConnectors, Databricks integration library

A PySpark-based Python package abstracting the I/O patterns between Azure Databricks and the half-dozen sinks an enterprise pipeline typically needs. Built during the ABN AMRO Bank engagement and open-sourced internally there.

#data-engineering #pyspark #databricks

project 2019-2021

DIME 2.0, Cloud-native data quality platform

Lead engineer on an enterprise data-quality platform with a pluggable connector framework for profiling, validation, and reconciliation across heterogeneous sources. Adopted across multiple internal customer engagements.

#data-engineering #data-quality #plugins

project 2018-2019

PySpark ETL optimisation for Citi Bank Singapore

Optimised long-running ETL pipelines for Citi Bank Singapore (TCS engagement). Cut Data Stage job execution from 4 hours to 1 hour through Python multiprocessing + PySpark parallelisation. Designed real-time message handling with Kafka, plus AVRO / Parquet conversion and HDFS compression strategies for the bank's machine-learning data pipelines.

#data-engineering #pyspark #kafka

feed

post 2025-09-12

Building a programmatic data-quality platform that teams actually adopt

Architectural pattern for a multi-tenant data-quality platform across many teams. Why centralising contracts beats centralising data, and how the SDK + control plane + fact/dim store pattern works regardless of which tools you reach for.

#data-engineering #data-quality #architecture #platform-engineering

← all tags