post data-engineering · 2026-05-03 · 7 min read

Streaming pipelines: the corner cases that the docs gloss over

#data-engineering#streaming#flink#spark#kafka#learnings

The first streaming job most teams ship is a batch job translated naively. The second is a real streaming job that quietly drops events. The third is the one where you’ve actually read the watermark documentation, set up side outputs, and instrumented enough to know when a sink is silently lying to you.

This post is the gap between what the docs say and what bites — corner cases I’ve hit across Flink 1.20 and Spark Structured Streaming 3.5/4.x that don’t show up in the “getting started” page.

Time-domain confusion (the prerequisite mistake)

Three timestamps live on every event:

event time — when the thing actually happened (clock on the producing device)
ingestion time — when the broker received it
processing time — when your operator sees it

                producer        broker            operator
                buffer/retry    partition lag     scheduling jitter
                    │                │                 │
event-time ─●───────│────────●───────│─────────●───────│──────────●─────→
            │       ▼        │       ▼         │       ▼          │
ingestion ──┼───────●────────┼───────●─────────┼───────●──────────┼─────→
            │                │                 │                  │
processing ─┴────────────────●─────────────────●──────────────────●─────→

  ● = same event observed in three different clocks

A query written against processing time works on a clean local laptop and silently corrupts results in production, because real producers buffer, retry, and drift. Always anchor windowing and joins to event time. Both Flink and Spark default to event time when you call withWatermark / assign_timestamps_and_watermarks, but PyFlink’s processing_time window helpers are tempting and wrong for almost any real use case.

Watermarks: same idea, different knobs

A watermark in both engines answers the same question: “given the events I’ve seen so far, how far back in event time am I still willing to accept new ones?” Anything older than that line is treated as late.

That much is shared. The practical differences are about how many knobs you get and what happens to data past the threshold:

	Spark Structured Streaming	Flink
Knobs	one threshold via `withWatermark("ts", "10 min")`	three layered knobs (see below)
Late events within the threshold	included in the window	included in the window
Late events past the threshold	dropped silently	drop, OR keep window open, OR route to DLQ — your choice
Watermark lives at	the query / stateful operator	every operator (it travels alongside the data)

Flink’s three knobs:

Bounded out-of-orderness — how far behind the max-seen event time the watermark sits. This is the equivalent of Spark’s single threshold.
Allowed lateness — keep the window open after the watermark passes, so further-late events still update an already-emitted result.
Side output — route events past allowed lateness to a separate stream you can write to a DLQ instead of dropping them.

Spark: one threshold, then drop

   max event_time seen ─────────────────────●
                                            │
                       ◀──── threshold ─────┤
                                            │
   ──── on-time ──────────────▶│ watermark │── late ──▶ DROPPED
                                            │
                                            ▼

Flink: three layered knobs, drop is opt-in not default

   ──── on-time ──────▶│ wm │── late, ≤ allowed_lateness ──▶ UPDATE window
                                                            │
                                            past allowed_lateness ──▶ side output ──▶ DLQ

The takeaway: if you ship the same job in both engines with default settings, Spark will silently lose more late events than Flink — not because the concept differs, but because Spark gives you one knob where Flink gives you three. Set Spark’s threshold conservatively, or build a manual late-data filter (next section).

Late-data side outputs (PyFlink) and the Spark workaround

                       ┌── on-time ───▶ window aggregation ──▶ main sink
   keyed stream ──▶ window
                       └── past watermark ──▶ side output (OutputTag) ──▶ DLQ

PyFlink:

from pyflink.common import Types, Time
from pyflink.datastream import OutputTag
from pyflink.datastream.window import TumblingEventTimeWindows

late_tag = OutputTag(
    "late-events",
    Types.ROW([Types.STRING(), Types.LONG()]),
)

windowed = (
    stream
    .key_by(lambda e: e.key)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .allowed_lateness(Time.minutes(1))
    .side_output_late_data(late_tag)
    .process(MyAggregator())
)

late_stream = windowed.get_side_output(late_tag)
late_stream.add_sink(dlq_sink)   # never lose a late event

Three years of running this pattern: roughly 0.1–2% of events arrive past the watermark in any non-trivial pipeline. Without a side output, that’s silent data loss.

Spark Structured Streaming has no equivalent. Records past the watermark are dropped at the stateful operator with no hook to capture them. The workaround is to compute lateness yourself before the window and route in PySpark:

from pyspark.sql import functions as F

watermark_seconds = 600  # must match what you pass to .withWatermark()

events = (
    spark.readStream.format("kafka").option(...).load()
    .selectExpr("CAST(value AS STRING) as payload", "timestamp as event_time")
)

late = events.filter(
    (F.unix_timestamp(F.current_timestamp()) - F.unix_timestamp("event_time"))
    > watermark_seconds
)
on_time = events.subtract(late)

# write `late` to a DLQ table; run windowed agg on `on_time`

It’s not as clean as Flink’s side output (no per-window watermark — you’re approximating), but it stops the silent-drop problem for the common case.

Stream-stream joins are constrained, not free

The tempting mental model is “two infinite streams, join when keys match, emit a row.” Reality:

Both sides need watermarks — without bounds on lateness, the join state grows forever.
Joins need an event-time window predicate (leftTime BETWEEN rightTime - 10 minutes AND rightTime + 5 minutes). Without it, every left-side row sits in state waiting for any right-side row, ever.
Outer joins emit nulls only when the watermark passes the window — your “left outer join” produces left-only rows on a delay matching the watermark, not immediately. This trips up alerting use cases (“alert when click happens with no purchase”) regularly.

In Flink: IntervalJoin enforces #2 explicitly. In Spark: stream-stream outer joins in append mode require watermarks on both sides plus a time-range predicate — the planner rejects the query otherwise. Inner joins are permitted without them, but state grows without bound, which is its own footgun. Either way, set the watermarks; the planner errors are the friendly version of the problem.

Session windows: the easiest way to balloon state

Session windows close after a gap of inactivity per key. Two failure modes:

Long-tail keys never close. A bot pings your endpoint every 4 minutes for a week. With a 5-minute gap, that “session” is open for 7 days, holding state the whole time. Cap session length explicitly, or use a periodic timer to force-close.

Sessions merge unexpectedly. Two events 4 minutes apart with a 5-minute gap window create one session. Add a 4-minute-late event that lands between them, gap is now 4-minute-3-minute-3-minute, and the engine merges them retroactively into one session, emitting a retraction + re-emit downstream. Make sure your sink handles updates, not just appends.

Exactly-once is end-to-end, not engine-only

This one bites the most. People read “Spark Structured Streaming supports exactly-once” and ship to production assuming dedup is solved. The actual guarantee:

Spark Structured Streaming: exactly-once into the file sink (atomic commit via _spark_metadata). The Kafka sink is at-least-once. Per the official docs, retried writes can produce duplicates. Workaround: enable Kafka’s idempotent producer (kafka.enable.idempotence=true) and dedup on read using a unique event id.
Flink: exactly-once via two-phase commit checkpoints, but only when paired with a sink that participates. The modern KafkaSink requires DeliveryGuarantee.EXACTLY_ONCE plus a unique transactionalIdPrefix; Iceberg sinks commit on checkpoint; JDBC has an XA-based exactly-once variant. A vanilla addSink with no transactional support is at-least-once.

The actionable rule: assume at-least-once for any stream → external sink, and design idempotent writes (MERGE on event_id) regardless of what the engine claims. Engine-level exactly-once protects state recovery on restart; it does not, by itself, dedupe across the wire to your warehouse.

Backpressure is invisible until it isn’t

When the sink is slower than the source, something has to give. The mechanisms differ wildly:

Flink: credit-based flow control. Slow sinks pause upstream operators. You see it as inPoolUsage / outPoolUsage climbing in the dashboard.
Spark Structured Streaming: micro-batch sizing. If a batch takes longer than the trigger interval, the next batch waits. With no maxOffsetsPerTrigger, a backlog dumps gigabytes into one batch and the job appears frozen.

Flink: backpressure visible as climbing pool usage upstream

   source ─[credits]─▶ map ─[credits=0]─▶ slow sink
                       ▲                       │
                       └── stalls ─────────────┘
                       outPoolUsage → 1.0  (the alert signal)

Spark: backpressure shows up as growing offset lag and trigger overruns

   trigger=30s  │ batch_n  ████████ 45s  │
                                          ▲
                                          batch_n+1 starts late,
                                          backlog grows, processedRowsPerSecond
                                          drops below inputRowsPerSecond

Both engines will eventually fall behind silently. Always alert on (latest source offset) - (committed offset). That single metric catches more incidents than any other streaming alert I’ve set up.

Schema evolution mid-stream

Producers will deploy a new schema version while your job is running. Three things happen:

Old schema events are still in transit (Kafka retention).
New schema events start arriving.
Your deserialiser was compiled against the old schema.

Survival requires (a) a schema registry with backward-compatibility checks enforced in CI, (b) DESERIALISATION_FAILURE → DLQ rather than crash-on-bad-message, and (c) a periodic CI replay of “last 7 days of events through current code” to catch changes that pass compatibility checks but break business logic.

When batch is genuinely the right answer

A few signals that the streaming framing is wrong:

Latency tolerance > 30 minutes — a scheduled hourly job is simpler, cheaper, and easier to backfill.
Multi-table joins involving slowly-changing dimensions — batch joins are saner than orchestrating watermarks across five streams.
Cost-sensitive workloads on cloud — keeping a Flink/Spark cluster warm 24/7 dwarfs the cost of an hourly serverless batch run.

Streaming buys you continuous latency. If your downstream consumer doesn’t actually need that, you’re paying for it anyway.

Closing

The surface API of streaming engines is small. The corner cases — watermark semantics, late-data routing, join constraints, session merges, end-to-end exactly-once, backpressure, schema drift — are where production lives.

The minimum bar I now apply to any streaming job before it ships: explicit watermark with measured p99 lateness, side-output or filter-and-DLQ for late records, idempotent sink keyed on event_id, alert on offset lag, and a quarterly replay test. That’s not the whole list; it’s the floor.