post data-engineering · 2026-02-04 · 5 min read
Medallion architecture: the rules I now follow for bronze, silver, gold
The medallion architecture (bronze → silver → gold) is one of those patterns that’s easy to nod along with and hard to actually execute. The rules sound simple. In practice, every team I’ve seen reinvent it ends up with bronze tables that have transformations baked in, silver tables that are more confusing than the raw data, and gold tables that nobody is sure are actually gold.
This post is the rule set I now follow, after three years of building, refactoring, and arguing about medallion lakes at TomTom and before. Treat it as opinion, not gospel — but if you adopt it as-is, you’ll skip a lot of the misery.
Bronze: capture, don’t transform
Rule 1: bronze is exactly what you got from the source, plus metadata.
Bronze is the raw landing zone. The single test for “is this bronze?” is: if the source went away tomorrow, could I reproduce my downstream tables from bronze alone?
What that means in practice:
- No schema flattening at the bronze layer. Nested JSON stays nested. Protobuf stays as a column of bytes plus a schema-version column.
- No deduplication. Same event arrived twice? Both rows go into bronze. Dedup is silver’s job.
- No business filters. “We don’t care about country=ZZ” is a silver concern, not a bronze one.
- Always add ingestion metadata:
_ingested_at,_source_file,_source_offset,_schema_version. These five columns will save you a debugging week eventually.
Why so strict: bronze is your time-travel insurance. If someone changes a business rule six months from now and asks “what would the silver table have looked like under the old rule?”, you can re-derive silver from bronze. If you bake the rule into bronze, you can’t.
# bronze ingestion: minimal, defensive, append-onlyevents_bronze = (raw_kafka_stream .selectExpr( "value AS event_payload", # raw bytes, untouched "timestamp AS _ingested_at", "topic AS _source_topic", "partition AS _source_partition", "offset AS _source_offset", "1 AS _schema_version", # bump when source schema changes ))
events_bronze.writeStream \ .format("delta") \ .outputMode("append") \ .option("checkpointLocation", "/chk/events_bronze") \ .toTable("bronze.events")Rule 2: bronze is append-only. Never UPDATE, never DELETE.
Append-only means bronze is a log. Logs are auditable, replayable, and don’t have weird half-deleted state. If you find yourself wanting to UPDATE bronze, you actually want a new silver table with cleaning rules.
Exception: GDPR / right-to-be-forgotten. Hard deletes for compliance reasons are fine. Everything else: append.
Silver: schema, clean, queryable
Rule 3: silver has an explicit schema, enforced.
Silver is where structure happens. Every silver table has a defined schema with column names, types, and nullability. If a row arrives that doesn’t match, you reject it (DLQ) or fail loud — you don’t let it pass and pollute downstream.
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
ROUTE_PLANNING_SCHEMA = StructType([ StructField("event_id", StringType(), nullable=False), StructField("user_id", StringType(), nullable=False), StructField("country", StringType(), nullable=False), StructField("started_at", TimestampType(), nullable=False), StructField("success", BooleanType(), nullable=False), StructField("failure_reason", StringType(), nullable=True),])
# Read bronze, parse, validatesilver = (bronze .selectExpr("from_protobuf(event_payload, 'RoutePlanningResultEvent') AS evt", "*") .select( col("evt.id").alias("event_id"), col("evt.user_id").alias("user_id"), ... ) .filter(col("event_id").isNotNull()) # reject malformed)Rule 4: silver is the layer at which dedup happens.
Bronze captured everything. Silver collapses to one row per logical key.
from pyspark.sql import Windowfrom pyspark.sql.functions import row_number
w = Window.partitionBy("event_id").orderBy(col("_ingested_at").desc())silver_dedup = silver.withColumn("rn", row_number().over(w)) \ .filter("rn = 1") \ .drop("rn")Rule 5: silver is queryable by a non-engineer with SQL.
The bar for “is this really silver?” is: can a PM or analyst, knowing only the column names, write a sensible SQL query to answer their own question? If they need to join three tables and parse a JSON blob first, you have not finished silvering yet.
This is the rule that gets violated most often. Engineers tend to leave silver in a “mostly cleaned” state because it works for them. Push it further. Flatten the nested fields they care about. Resolve the foreign keys they care about (denormalise where it helps). Names should be lowercase_snake_case and consistent.
Rule 6: silver is idempotent on re-run.
You can replay last week’s bronze data and silver should converge to the same state. This means: deterministic transformations, no current_timestamp()-based logic, no random sampling. If you need timestamps, take them from the data, not the wall clock.
Gold: business logic, named for who consumes it
Rule 7: gold tables are named for the team / dashboard / use-case they serve.
Bronze and silver are organised by source (one table per upstream event type, more or less). Gold is organised by consumer. A gold.executive_kpi_dashboard_daily is different from gold.ops_alert_thresholds_hourly even if they touch the same silver underlay.
This is the rule that prevents gold from becoming a junk drawer. Every gold table has a clearly-named owner and consumer; if you can’t answer “who reads this?” at the time of creation, you don’t have a gold table, you have a half-thought-out aggregation.
Rule 8: gold is allowed to break the bronze→silver→gold boundary, but not the other way.
Gold can join across silver tables, do business-rule arithmetic, apply taxonomies that don’t exist in silver. That’s fine. But silver should never depend on gold (that would make refactoring gold risky), and bronze should never depend on either.
bronze silver gold ────── ────── ──── sources → structures → business ↑ doesn't depend on gold
gold reads silver freely. gold reads bronze never.What I no longer do
A few patterns I’ve stopped:
- One-step bronze-to-gold pipelines. Tempting for “small” tables. Always blows up later when someone wants to re-run a transform with a different rule. Three layers, always.
- Silver tables with
transformed_atofcurrent_timestamp(). Breaks idempotency. Use the source event time. - Mixing append-only and CDC bronze. If half your bronze tables are append-only event logs and the other half are CDC snapshots, it confuses everyone. Split them visually:
bronze_events.*vsbronze_cdc.*. - Gold tables that nobody owns. When the consumer team disbands, the gold table becomes orphaned. Add an
owner_teamfield to your data catalog and review quarterly.
What good looks like
Six months into a new medallion lake, you should be able to answer:
- For any silver row, which bronze rows produced it? (lineage)
- For any silver schema column, what is its meaning? (catalog)
- For any gold table, who consumes it and what’s their SLA? (ownership)
- If we change a rule in silver, can we backfill from bronze without losing data? (replayability)
If yes to all four, the architecture is doing its job. If no to any, you have a specific thing to fix.
Closing
Rules are easy to write, hard to enforce. The way I keep medallion discipline on the team is two things:
- Code review on schema changes. Any silver-table schema change goes through a review that explicitly asks “could a non-engineer query this?” If the answer is no, change is rejected.
- A “data debt” backlog. Every time we let a rule slip (“ok, just this once we’ll bake the country filter into bronze”), it goes on the backlog. We work it down quarterly. Without that, every shortcut becomes permanent.
Bronze captures, silver structures, gold serves. If those three sentences are alive in your head when you write a pipeline, you mostly do not need rules 1 through 8. The rules exist to recover when the sentences slip.