project Β· 2023-2024

Vehicle-telemetry silver-layer ETL

Refactored a vehicle-telemetry processing pipeline. Transforms raw nested JSON / protobuf from millions of in-car navigation clients into clean Delta tables that power navigation-quality dashboards. Designed for query simplicity: PMs answer 'route success rate in country X this week' with a single SELECT.

Refactored the data pipeline that processes vehicle-navigation telemetry from millions of in-car clients. The output is the silver layer of TomTom’s medallion-architecture data lake, clean, queryable Delta tables that downstream PMs, ops engineers, and partner-facing dashboards all read from.

The bar in one question

Ask a PM-shaped question, β€œWhat was the route-planning success rate in Germany last week?”, and see what answering it looked like before the refactor versus after.

before~50 lines, 1+ hour of engineer time per question
# pyspark, against raw bronze protobuf-decoded JSON
df = spark.read.parquet("/bronze/telemetry/2024/W12")

# unwrap the nested envelope
df = df.select(
F.col("envelope.client_id").alias("client_id"),
F.col("envelope.country").alias("country"),
F.explode("envelope.events").alias("ev"),
)
# events is an array of unions; only some are RoutePlanningResultEvent
df = df.where(
F.col("ev.type")
  == "tt.navigation.RoutePlanningResultEvent"
)

# the success flag is buried 3 levels deep, varies by schema version
df = df.withColumn(
"success",
F.when(F.col("ev.payload.v2.outcome") == "OK", 1)
 .when(F.col("ev.payload.v1.result") == "success", 1)
 .otherwise(0),
)
df = df.where(F.col("country") == "DE")

# now finally compute the rate
df.agg(
(F.sum("success") / F.count("*")).alias("rate")
).show()
after3 lines, < 1 minute, no engineering needed
-- against the silver-layer Delta table
SELECT
AVG(success::int) AS rate
FROM silver.route_planning
WHERE country = 'DE'
AND week = '2024-W12';

That delta is the whole point of a well-designed silver layer: a non-engineer can answer the question they actually have, in the tool they actually have. Hours of engineer time evaporate, replaced by a single SELECT.

What was wrong with the old pipeline

The β€œbefore” was a Databricks pipeline that did sort-of work but wasn’t sustainable:

What I built

Architecture

in-car navigation client
↓ MQTT
telemetry backend
↓
Azure Event Hub
↓
ingest service
↓
Azure Data Lake (bronze, raw protobuf)
↓ ── MY WORK ──
Silver-layer ETL (Databricks / PySpark)
↓
Delta tables (silver: route_planning, connectivity, traffic_health, …)
↓
Grafana / partner dashboards / SLA reports

Why this matters for the business

Navigation telemetry isn’t a vanity metric, it’s contractual. OEM partners pay for navigation that meets agreed quality bars. If route success rates dip, TomTom needs to know within hours, not weeks. The silver layer is what makes that detection-and-explanation loop fast.

Why this earns a spot in projects

Data engineering work is invisible when it’s good, nobody compliments your ETL. But the bar for β€œgood” silver-layer design is can a non-technical PM answer their own question in one query without engineering help? On that bar, this one shipped.

The silver layer is also what unlocks the DCP Guardian AI agent on top of it. The agent walks an event registry and metric catalog that only exist because this pipeline exists. Stable schemas in the data layer β†’ reliable answers in the agent layer. They are the same project at two levels of abstraction.

← all projects