project · 2023-2024
Vehicle-telemetry silver-layer ETL
Refactored a vehicle-telemetry processing pipeline. Transforms raw nested JSON / protobuf from millions of in-car navigation clients into clean Delta tables that power navigation-quality dashboards. Designed for query simplicity: PMs answer 'route success rate in country X this week' with a single SELECT.
Refactored the data pipeline that processes vehicle-navigation telemetry from millions of in-car clients. The output is the silver layer of TomTom’s medallion-architecture data lake, clean, queryable Delta tables that downstream PMs, ops engineers, and partner-facing dashboards all read from.
What was wrong with the old pipeline
The “before” was a Databricks pipeline that did sort-of work but wasn’t sustainable:
- Raw data was unqueryable. Telemetry events arrived as deeply nested JSON / protobuf, a single trace contained dozens of nested arrays. A simple PM question like “route-planning success rate in Germany this week” required custom parsing code, ad-hoc each time. Hours of engineer time per question.
- Pipeline was expensive and brittle. Multiple inefficient passes over the same data, no clean separation between bronze (raw) and silver (cleaned), schema drift broke jobs unpredictably.
What I built
- Bronze → Silver transformation with explicit schema flattening for the high-cardinality fields PMs actually care about, route success/failure, response time, vehicle type, country, failure reason.
- Idempotent windowed processing with Delta Lake’s MERGE so reprocessing a day’s data doesn’t double-count.
- Schema-versioning baked into the pipeline so when upstream protobuf evolves, the silver layer stays backwards-compatible.
- Cost-aware retention tiering: hot Delta tables for the last 30 days, parquet on ADLS Gen2 for warm, cold archive after that. Same data, ~70% cost reduction at the storage tier.
Architecture
in-car navigation client ↓ MQTTtelemetry backend ↓Azure Event Hub ↓ingest service ↓Azure Data Lake (bronze, raw protobuf) ↓ ── MY WORK ──Silver-layer ETL (Databricks / PySpark) ↓Delta tables (silver: route_planning, connectivity, traffic_health, …) ↓Grafana / partner dashboards / SLA reportsWhy this matters for the business
Navigation telemetry isn’t a vanity metric, it’s contractual. OEM partners pay for navigation that meets agreed quality bars. If route success rates dip, TomTom needs to know within hours, not weeks. The silver layer is what makes that detection-and-explanation loop fast.
Why this earns a spot in projects
Data engineering work is invisible when it’s good, nobody compliments your ETL. But the bar for “good” silver-layer design is can a non-technical PM answer their own question in one query without engineering help? On that bar, this one shipped.
The silver layer is also what unlocks the DCP Guardian AI agent on top of it. The agent walks an event registry and metric catalog that only exist because this pipeline exists. Stable schemas in the data layer → reliable answers in the agent layer. They are the same project at two levels of abstraction.