post data-engineering series · Streaming-data architectures · 2025-07-08 · 6 min read

Architecture series, part 1: Lambda architecture, where it came from and why it hurt

#data-engineering#architecture#lambda#streaming#batch#learnings

Part 1 of a 2-post architecture series. Part 1 (this post): Lambda architecture, where it came from and why it hurt Part 2: Kappa architecture, the reaction

In 2011, Nathan Marz was at BackType (acquired by Twitter that July). The data-platform problem of the era was: batch jobs (Hadoop MapReduce) gave you correct, complete answers but with hours of latency. Stream processing (Storm, which Marz had built at BackType the year before) gave you sub-second latency but was operationally painful and didn’t reprocess history cleanly. Each was right for half the question.

Marz proposed that the answer wasn’t to pick one — it was to run both, and merge their outputs at query time. He called it Lambda architecture. It dominated big-data thinking for the next five years.

This post is the architecture, the problem it solved, and the operational pain that eventually made most teams move on.

The problem Lambda was answering

If you’re an analyst in 2014 looking at a dashboard, you want two contradictory things:

Up-to-the-second freshness. Last 5 minutes of activity should show.
Historically-correct answers. The number for “yesterday” should match what you’d get if you re-ran the query a year from now. No drift, no fudged late events.

Batch (Hadoop, Spark) gives you (2) but not (1): the daily job runs at 3am and that’s the only time numbers update. Streaming (Storm, Flink) gives you (1) but not (2): edge cases like late-arriving events, processing-time drift, and exactly-once delivery were unreliable in 2014’s tooling.

Lambda’s answer: do both. Run a batch layer that reprocesses the entire history nightly for correctness. Run a speed layer that processes the last few hours as a stream for freshness. Have a serving layer that merges the two at query time, returning batch-results for old data and stream-results for the recent window.

The architecture

                 ┌────────────────────────────────────┐
                 │           Master dataset           │
                 │   (immutable, append-only event   │
                 │    log; the source of truth)       │
                 └──────────────┬─────────────────────┘
                                │
                 ┌──────────────┴─────────────────┐
                 │                                │
            ┌────▼─────┐                  ┌──────▼────────┐
            │  Batch   │                  │     Speed     │
            │  layer   │                  │     layer     │
            │ (Hadoop, │                  │ (Storm, Flink)│
            │  Spark)  │                  │               │
            │          │                  │  processes    │
            │  full    │                  │  recent window│
            │  recompute│                  │  in real time │
            │  nightly │                  │               │
            └────┬─────┘                  └──────┬────────┘
                 │                                │
                 │                                │
            ┌────▼────────────────────────────────▼──┐
            │            Serving layer               │
            │    (HBase, Cassandra, key-value store) │
            │                                        │
            │  Query merges batch view + speed view  │
            └────────────────────────────────────────┘
                                │
                                ▼
                              Query

Three layers, each with a clear job. The master dataset is immutable; you can always re-derive any view from it. The batch layer recomputes everything nightly so any bugs in the streaming layer get auto-corrected within 24 hours. The speed layer keeps things fresh while you wait for batch.

Why it was right (in 2014)

Three things were true in 2014 that aren’t true now:

Stream processors were unreliable for exactly-once delivery. Storm offered “at-least-once” with idempotent sinks. Reprocessing a stream for correctness was painful.
Batch frameworks were the only reliable way to do complex aggregations over large data. Spark was just emerging; Hadoop MapReduce was the workhorse.
Event logs (durable, replayable) weren’t standard infra. Kafka existed but wasn’t yet the de-facto pipeline backbone.

In that world, Lambda made sense. Batch handled the cases stream couldn’t; stream handled the cases batch couldn’t.

The operational cost (the part the architecture diagrams hide)

Here’s what nobody put in the architecture diagram:

Two codebases for the same logic. The “compute average response time per endpoint per hour” rule lives in two places: the Spark job (batch) and the Storm topology (speed). Want to change the rule? Edit two files in two repos using two different APIs in two different languages.

Diverged correctness. Despite best intentions, the batch and streaming implementations drift. The batch job uses one timestamp parser; the stream uses another. The batch handles a null field as zero; the stream handles it as missing. Now query-time results differ depending on whether you’re reading the batch view or the speed view, and debugging that is a nightmare.

Doubled deploys, doubled monitoring, doubled on-call. Two systems to alert on. Two systems to scale. Two systems to roll back when something breaks.

Backfills hit both. Late-arriving data needs to land in both views. The merge logic at query time has to handle the case where batch hasn’t caught up yet but stream already saw the event. The “correct” answer depends on which view a query routes to.

The serving layer is non-trivial. Merging batch + stream at query time isn’t just UNION; you have to deduplicate (events seen in both layers), reconcile semantics (batch and speed may have computed slightly different things), and handle the boundary (events from the last hour: are they in batch yet?).

I’ve yet to meet an engineer who maintained a Lambda system for two years and didn’t sigh while explaining it.

When Lambda still makes sense

A few cases where I wouldn’t talk a team out of Lambda:

Regulated environments. Banking, healthcare, anything with audit requirements. The batch layer is a defensible, fully-replayable source of truth that auditors can reason about. The speed layer is a “best-effort” overlay. Some compliance regimes explicitly prefer this separation.
Pre-existing massive batch infrastructure. If you have 10 years of Spark jobs and 200 TB of historical data, ripping it out for streaming is a multi-year project. Lambda lets you bolt streaming on without abandoning the batch investment.
Workloads where batch IS faster. Some aggregations (window-shaped, full-history analyses) genuinely run cheaper as a nightly batch than as a streaming job that holds 24 hours of state.

In these cases, accept the cost. The architecture earns its keep.

When NOT to pick Lambda in 2025

In most other cases, the cost outweighs the benefit. Specifically, if:

You have Kafka (or any durable event log) as your pipeline backbone.
You have a modern stream processor (Flink 1.x+, Spark Structured Streaming 3.x+, Kafka Streams) that supports exactly-once and stateful operations.
Your team’s bandwidth for maintaining infrastructure is finite.

…then Kappa architecture (covered in part 2) is the right call. One pipeline, replayable from the event log, no diverged-codebase pain.

What changed between 2014 and now

Three things made Lambda’s tradeoffs less attractive over time:

Stream processors got better. Flink shipped end-to-end exactly-once via TwoPhaseCommitSinkFunction in 1.4.0 (December 2017); internal exactly-once via checkpointing landed earlier. Spark Structured Streaming reached GA in 2.2.0 (2017) and made stream programming look like batch. Stateful operations, watermarks, and idempotent sinks became commodity.
Kafka (and equivalent durable event logs) became universal. When the source of truth IS the event log, replaying it through a stream processor is batch processing. The distinction starts to feel artificial.
Cloud and container orchestration made operations cheaper. Running one streaming job on Kubernetes is no longer hard; running two systems isn’t twice the work it used to be, but it’s still 2x the surface area to debug.

Closing

Lambda solved a real 2014 problem with the tools available in 2014. The cost was running two pipelines for the same logic — manageable when streaming was unreliable, painful when streaming caught up.

If you’re starting a fresh data platform today and aren’t in a regulated/legacy-batch context, you almost certainly want Kappa. Part 2 covers it: same goals, single pipeline, replay-from-event-log for reprocessing.

Continue: part 2 — Kappa architecture, the reaction.