project Β· 2023-2024
Vehicle-telemetry silver-layer ETL
Refactored a vehicle-telemetry processing pipeline. Transforms raw nested JSON / protobuf from millions of in-car navigation clients into clean Delta tables that power navigation-quality dashboards. Designed for query simplicity: PMs answer 'route success rate in country X this week' with a single SELECT.
Refactored the data pipeline that processes vehicle-navigation telemetry from millions of in-car clients. The output is the silver layer of TomTomβs medallion-architecture data lake, clean, queryable Delta tables that downstream PMs, ops engineers, and partner-facing dashboards all read from.
The bar in one question
Ask a PM-shaped question, βWhat was the route-planning success rate in Germany last week?β, and see what answering it looked like before the refactor versus after.
# pyspark, against raw bronze protobuf-decoded JSON
df = spark.read.parquet("/bronze/telemetry/2024/W12")
# unwrap the nested envelope
df = df.select(
F.col("envelope.client_id").alias("client_id"),
F.col("envelope.country").alias("country"),
F.explode("envelope.events").alias("ev"),
)
# events is an array of unions; only some are RoutePlanningResultEvent
df = df.where(
F.col("ev.type")
== "tt.navigation.RoutePlanningResultEvent"
)
# the success flag is buried 3 levels deep, varies by schema version
df = df.withColumn(
"success",
F.when(F.col("ev.payload.v2.outcome") == "OK", 1)
.when(F.col("ev.payload.v1.result") == "success", 1)
.otherwise(0),
)
df = df.where(F.col("country") == "DE")
# now finally compute the rate
df.agg(
(F.sum("success") / F.count("*")).alias("rate")
).show()
-- against the silver-layer Delta table
SELECT
AVG(success::int) AS rate
FROM silver.route_planning
WHERE country = 'DE'
AND week = '2024-W12';
That delta is the whole point of a well-designed silver layer: a non-engineer can answer the question they actually have, in the tool they actually have. Hours of engineer time evaporate, replaced by a single SELECT.
What was wrong with the old pipeline
The βbeforeβ was a Databricks pipeline that did sort-of work but wasnβt sustainable:
- Raw data was unqueryable. Telemetry events arrived as deeply nested JSON / protobuf, a single trace contained dozens of nested arrays. A simple PM question like βroute-planning success rate in Germany this weekβ required custom parsing code, ad-hoc each time. Hours of engineer time per question.
- Pipeline was expensive and brittle. Multiple inefficient passes over the same data, no clean separation between bronze (raw) and silver (cleaned), schema drift broke jobs unpredictably.
What I built
- Bronze β Silver transformation with explicit schema flattening for the high-cardinality fields PMs actually care about, route success/failure, response time, vehicle type, country, failure reason.
- Idempotent windowed processing with Delta Lakeβs MERGE so reprocessing a dayβs data doesnβt double-count.
- Schema-versioning baked into the pipeline so when upstream protobuf evolves, the silver layer stays backwards-compatible.
- Cost-aware retention tiering: hot Delta tables for the last 30 days, parquet on ADLS Gen2 for warm, cold archive after that. Same data, ~70% cost reduction at the storage tier.
Architecture
in-car navigation client β MQTTtelemetry backend βAzure Event Hub βingest service βAzure Data Lake (bronze, raw protobuf) β ββ MY WORK ββSilver-layer ETL (Databricks / PySpark) βDelta tables (silver: route_planning, connectivity, traffic_health, β¦) βGrafana / partner dashboards / SLA reportsWhy this matters for the business
Navigation telemetry isnβt a vanity metric, itβs contractual. OEM partners pay for navigation that meets agreed quality bars. If route success rates dip, TomTom needs to know within hours, not weeks. The silver layer is what makes that detection-and-explanation loop fast.
Why this earns a spot in projects
Data engineering work is invisible when itβs good, nobody compliments your ETL. But the bar for βgoodβ silver-layer design is can a non-technical PM answer their own question in one query without engineering help? On that bar, this one shipped.
The silver layer is also what unlocks the DCP Guardian AI agent on top of it. The agent walks an event registry and metric catalog that only exist because this pipeline exists. Stable schemas in the data layer β reliable answers in the agent layer. They are the same project at two levels of abstraction.