post data-engineering series Β· Streaming-data architectures Β· 2025-07-08 Β· 6 min read
Architecture series, part 1: Lambda architecture, where it came from and why it hurt
Part 1 of a 2-post architecture series. Β Β Part 1 (this post): Lambda architecture, where it came from and why it hurt Β Β Part 2: Kappa architecture, the reaction
In 2011, Nathan Marz was at BackType (acquired by Twitter that July). The data-platform problem of the era was: batch jobs (Hadoop MapReduce) gave you correct, complete answers but with hours of latency. Stream processing (Storm, which Marz had built at BackType the year before) gave you sub-second latency but was operationally painful and didnβt reprocess history cleanly. Each was right for half the question.
Marz proposed that the answer wasnβt to pick one β it was to run both, and merge their outputs at query time. He called it Lambda architecture. It dominated big-data thinking for the next five years.
This post is the architecture, the problem it solved, and the operational pain that eventually made most teams move on.
The problem Lambda was answering
If youβre an analyst in 2014 looking at a dashboard, you want two contradictory things:
- Up-to-the-second freshness. Last 5 minutes of activity should show.
- Historically-correct answers. The number for βyesterdayβ should match what youβd get if you re-ran the query a year from now. No drift, no fudged late events.
Batch (Hadoop, Spark) gives you (2) but not (1): the daily job runs at 3am and thatβs the only time numbers update. Streaming (Storm, Flink) gives you (1) but not (2): edge cases like late-arriving events, processing-time drift, and exactly-once delivery were unreliable in 2014βs tooling.
Lambdaβs answer: do both. Run a batch layer that reprocesses the entire history nightly for correctness. Run a speed layer that processes the last few hours as a stream for freshness. Have a serving layer that merges the two at query time, returning batch-results for old data and stream-results for the recent window.
The architecture
ββββββββββββββββββββββββββββββββββββββ β Master dataset β β (immutable, append-only event β β log; the source of truth) β ββββββββββββββββ¬ββββββββββββββββββββββ β ββββββββββββββββ΄ββββββββββββββββββ β β ββββββΌββββββ ββββββββΌβββββββββ β Batch β β Speed β β layer β β layer β β (Hadoop, β β (Storm, Flink)β β Spark) β β β β β β processes β β full β β recent windowβ β recomputeβ β in real time β β nightly β β β ββββββ¬ββββββ ββββββββ¬βββββββββ β β β β ββββββΌβββββββββββββββββββββββββββββββββΌβββ β Serving layer β β (HBase, Cassandra, key-value store) β β β β Query merges batch view + speed view β ββββββββββββββββββββββββββββββββββββββββββ β βΌ QueryThree layers, each with a clear job. The master dataset is immutable; you can always re-derive any view from it. The batch layer recomputes everything nightly so any bugs in the streaming layer get auto-corrected within 24 hours. The speed layer keeps things fresh while you wait for batch.
Why it was right (in 2014)
Three things were true in 2014 that arenβt true now:
- Stream processors were unreliable for exactly-once delivery. Storm offered βat-least-onceβ with idempotent sinks. Reprocessing a stream for correctness was painful.
- Batch frameworks were the only reliable way to do complex aggregations over large data. Spark was just emerging; Hadoop MapReduce was the workhorse.
- Event logs (durable, replayable) werenβt standard infra. Kafka existed but wasnβt yet the de-facto pipeline backbone.
In that world, Lambda made sense. Batch handled the cases stream couldnβt; stream handled the cases batch couldnβt.
The operational cost (the part the architecture diagrams hide)
Hereβs what nobody put in the architecture diagram:
Two codebases for the same logic. The βcompute average response time per endpoint per hourβ rule lives in two places: the Spark job (batch) and the Storm topology (speed). Want to change the rule? Edit two files in two repos using two different APIs in two different languages.
Diverged correctness. Despite best intentions, the batch and streaming implementations drift. The batch job uses one timestamp parser; the stream uses another. The batch handles a null field as zero; the stream handles it as missing. Now query-time results differ depending on whether youβre reading the batch view or the speed view, and debugging that is a nightmare.
Doubled deploys, doubled monitoring, doubled on-call. Two systems to alert on. Two systems to scale. Two systems to roll back when something breaks.
Backfills hit both. Late-arriving data needs to land in both views. The merge logic at query time has to handle the case where batch hasnβt caught up yet but stream already saw the event. The βcorrectβ answer depends on which view a query routes to.
The serving layer is non-trivial. Merging batch + stream at query time isnβt just UNION; you have to deduplicate (events seen in both layers), reconcile semantics (batch and speed may have computed slightly different things), and handle the boundary (events from the last hour: are they in batch yet?).
Iβve yet to meet an engineer who maintained a Lambda system for two years and didnβt sigh while explaining it.
When Lambda still makes sense
A few cases where I wouldnβt talk a team out of Lambda:
- Regulated environments. Banking, healthcare, anything with audit requirements. The batch layer is a defensible, fully-replayable source of truth that auditors can reason about. The speed layer is a βbest-effortβ overlay. Some compliance regimes explicitly prefer this separation.
- Pre-existing massive batch infrastructure. If you have 10 years of Spark jobs and 200 TB of historical data, ripping it out for streaming is a multi-year project. Lambda lets you bolt streaming on without abandoning the batch investment.
- Workloads where batch IS faster. Some aggregations (window-shaped, full-history analyses) genuinely run cheaper as a nightly batch than as a streaming job that holds 24 hours of state.
In these cases, accept the cost. The architecture earns its keep.
When NOT to pick Lambda in 2025
In most other cases, the cost outweighs the benefit. Specifically, if:
- You have Kafka (or any durable event log) as your pipeline backbone.
- You have a modern stream processor (Flink 1.x+, Spark Structured Streaming 3.x+, Kafka Streams) that supports exactly-once and stateful operations.
- Your teamβs bandwidth for maintaining infrastructure is finite.
β¦then Kappa architecture (covered in part 2) is the right call. One pipeline, replayable from the event log, no diverged-codebase pain.
What changed between 2014 and now
Three things made Lambdaβs tradeoffs less attractive over time:
-
Stream processors got better. Flink shipped end-to-end exactly-once via TwoPhaseCommitSinkFunction in 1.4.0 (December 2017); internal exactly-once via checkpointing landed earlier. Spark Structured Streaming reached GA in 2.2.0 (2017) and made stream programming look like batch. Stateful operations, watermarks, and idempotent sinks became commodity.
-
Kafka (and equivalent durable event logs) became universal. When the source of truth IS the event log, replaying it through a stream processor is batch processing. The distinction starts to feel artificial.
-
Cloud and container orchestration made operations cheaper. Running one streaming job on Kubernetes is no longer hard; running two systems isnβt twice the work it used to be, but itβs still 2x the surface area to debug.
Closing
Lambda solved a real 2014 problem with the tools available in 2014. The cost was running two pipelines for the same logic β manageable when streaming was unreliable, painful when streaming caught up.
If youβre starting a fresh data platform today and arenβt in a regulated/legacy-batch context, you almost certainly want Kappa. Part 2 covers it: same goals, single pipeline, replay-from-event-log for reprocessing.
Continue: part 2 β Kappa architecture, the reaction.