post data-engineering series · Streaming-data architectures · 2025-11-15 · 6 min read

Architecture series, part 2: Kappa architecture, the reaction

#data-engineering#architecture#kappa#streaming#kafka#learnings

Part 2 of a 2-post architecture series. Part 1: Lambda architecture, where it came from and why it hurt Part 2 (this post): Kappa architecture, the reaction

In 2014, Jay Kreps (LinkedIn, later Confluent) wrote a blog post titled “Questioning the Lambda Architecture.” The thesis was simple and provocative: if your stream processor can reprocess history by replaying the event log, why do you need a separate batch layer at all?

The idea was Kappa architecture, named purely to riff on Lambda. It became the dominant streaming-platform pattern by the late 2010s and is still the right default for most modern data platforms.

This post is how Kappa works, why it’s usually the right call now, and the specific operational concerns that don’t go away just because you removed the batch layer.

The argument

Lambda’s pain was running the same logic in two places (batch + speed). What if the speed layer could also do the batch layer’s job?

That’s possible if two things are true:

The source of truth is a durable, replayable event log (Kafka, Kinesis, Pulsar, equivalent).
The stream processor can replay the log from any offset to reprocess historical data with the same code path it uses for live data.

In 2014 those weren’t yet standard. By 2018 they were. Kappa became viable.

The architecture

                    ┌─────────────────────────────────┐
                    │      Durable event log          │
                    │  (Kafka, Kinesis, Pulsar, …)    │
                    │   immutable, partitioned,       │
                    │   long-retention, replayable    │
                    └─────────────┬───────────────────┘
                                  │
                                  │ (single subscription)
                                  ▼
                    ┌─────────────────────────────────┐
                    │     Stream processor            │
                    │  (Flink, Spark Structured       │
                    │   Streaming, Kafka Streams)     │
                    │                                 │
                    │  same code path for live data   │
                    │  AND for reprocessing history   │
                    └─────────────┬───────────────────┘
                                  │
                                  ▼
                    ┌─────────────────────────────────┐
                    │     Serving layer               │
                    │  (the latest computed view)     │
                    └─────────────┬───────────────────┘
                                  │
                                  ▼
                                Query

Two layers (event log + stream processor) plus a serving store. No separate batch path. To “rerun history” with new logic: deploy the new stream-processor code, rewind to offset 0, let it replay through the entire log into a new output, atomically swap the serving layer to point at the new output. No batch job, no merge logic.

Why it usually wins

Three concrete benefits over Lambda:

One codebase. The “compute average response time per endpoint per hour” rule lives in one Flink job, written once, tested once, deployed once. Want to change the rule? Edit one file. Want to backfill the change to historical data? Replay from offset 0 with the new code. No second implementation to keep in sync.

Simpler operational footprint. One streaming system to monitor. One on-call rotation. One scaling story. Half the alert fatigue.

No merge logic at query time. The serving layer holds one view. Queries go straight to it. There’s no “which layer is this query routed to?” to debug.

The specific operational concerns Kappa doesn’t make go away

The naive Kappa pitch is “just use streaming everywhere.” The reality is more nuanced. Five things that bite you:

1. Event-log retention costs

Kappa assumes the event log holds enough history to reprocess from. If you set Kafka retention to 7 days and decide you need to backfill 6 months, you’re stuck — the data isn’t there.

The fix: keep all events forever or tier old events to cheap object storage (Kafka Tiered Storage, Confluent Tiered Storage, or DIY: dump segments to S3 + replay-from-S3 tooling). Both options exist; both have a cost. Budget for it explicitly.

2. Reprocessing time

Replaying a 5-billion-event log to reprocess history takes hours or days depending on stream processor scale. During reprocessing, you have a parallel pipeline running alongside live traffic, costing 2x compute. You also need two serving outputs (old view + new view) until you swap.

The fix: do reprocessing during low-traffic windows. Pre-warm caches before the swap. Have a rollback plan if the new view turns out wrong (point serving back at the old view, fix code, reprocess again).

3. Stateful streaming is harder than stateless

Stateless transforms (filter, project, parse) replay trivially. Stateful ones (sessionisation, joins, aggregations with windows) hold state per key. Replaying from offset 0 means rebuilding all that state from scratch.

Three implications:

State size during replay can be huge (the whole history’s worth of sessions, joins, aggregations).
Watermark behaviour differs during replay vs live (event timestamps from a year ago vs from now). The replay engine needs to advance watermarks based on event-time, not wall-clock time.
Stateful operators may need explicit eventTimeOrder=true to avoid surprises.

The fix: design stateful operators with replay in mind. Use Flink’s processing-time-vs-event-time semantics carefully. Test reprocessing on a representative subset before running it for real.

4. Schema evolution is harder

When the stream processor sees an event from 2 years ago, it needs to deserialise it. If the event schema has changed 4 times since then, the processor needs to know all 4 versions and how to migrate between them.

The fix: keep schemas backward-compatible (only add nullable fields, never remove or rename). Use a schema registry (Confluent Schema Registry, Avro, Protobuf) and treat backward-compatibility as a contract that ships with the schema. Without this discipline, replays break on old events and you can’t reprocess.

5. Sink idempotency is non-negotiable

Replays write to the sink with the same logic that live data uses. If the sink isn’t idempotent, you get duplicates after replay. Every Kappa pipeline needs:

Idempotent writes: MERGE-style upsert keyed on a unique event id, not append-only INSERT.
Atomic swap on output: when reprocessing into a new output, switch readers atomically (e.g., flip a Delta table version, repoint a database alias) rather than draining the old one.

If your sink doesn’t support upserts (analytics warehouses sometimes don’t), you lose Kappa’s main argument. Pick a sink that does.

When Kappa wins (concretely)

In 2025, Kappa is the right call when:

Your source data is event-shaped (clicks, transactions, sensor readings, log lines).
You’re starting fresh (no decade of batch jobs to replace).
Your team is comfortable with stream-processing semantics.
Your sinks support idempotent writes.

Roughly: “most modern data platforms that aren’t already built around batch.”

When Lambda still wins (or a hybrid)

A few specific cases where I’d pick Lambda or a Kappa-with-batch-augmentation hybrid:

Regulated audit requirements: see Part 1 — some compliance regimes prefer the batch layer as a defensible source of truth.
Massive existing batch investment: rewriting 10 years of Spark in Flink is a multi-year project, often not worth it.
Workloads where batch is genuinely cheaper: a once-a-quarter regulatory report computed over the entire dataset is cheaper as a Spark job than as a Flink job holding state for 90 days.
Mixed-style workloads: real-time dashboards (Kappa) + monthly cohort analyses (batch) on the same source. Run both, label them clearly, accept the cost.

What modern Kappa actually looks like

In 2025, a typical Kappa setup might look like:

Event log: Kafka with tiered storage (hot = 7 days SSD, warm = 1 year object storage).
Stream processor: Flink for high-throughput / complex stateful work, or Kafka Streams for simpler transforms.
Sink: Delta Lake, Apache Iceberg, or Apache Hudi (transactional table formats with MERGE).
Schema registry: Confluent or Karapace, with mandatory backward-compatibility checks in CI.
Replay tooling: A “rewind to offset X” runbook, a test environment that mirrors production, and a CI job that replays the last week’s events on every code change to catch reprocessing bugs.

The infrastructure has matured to the point where this is a maintained-by-three-engineers stack, not a small army. That maturation is what made Kappa practical at scale.

Closing the series

Lambda solved a 2014 problem. Kappa solved Lambda’s pain by leaning on durable event logs and capable stream processors that didn’t exist in 2014. The architecture choice in 2025 is mostly “Kappa unless you’re in a regulated / legacy-batch context.”

The real lesson is more general than either: architecture decisions are time-bounded. The answer in 2014 was Lambda; in 2018 it was Kappa; in 2030 it’ll be something neither of us has named yet. What matters is understanding why each generation chose what it chose, so you can recognise when conditions have changed enough to warrant rethinking.

Back to: part 1 — Lambda architecture, where it came from and why it hurt.