post data-engineering Β· 2026-05-03 Β· 7 min read

Streaming pipelines: the corner cases that the docs gloss over

#data-engineering#streaming#flink#spark#kafka#learnings

The first streaming job most teams ship is a batch job translated naively. The second is a real streaming job that quietly drops events. The third is the one where you’ve actually read the watermark documentation, set up side outputs, and instrumented enough to know when a sink is silently lying to you.

This post is the gap between what the docs say and what bites β€” corner cases I’ve hit across Flink 1.20 and Spark Structured Streaming 3.5/4.x that don’t show up in the β€œgetting started” page.

Time-domain confusion (the prerequisite mistake)

Three timestamps live on every event:

producer broker operator
buffer/retry partition lag scheduling jitter
β”‚ β”‚ β”‚
event-time ─●───────│────────●───────│─────────●───────│──────────●─────→
β”‚ β–Ό β”‚ β–Ό β”‚ β–Ό β”‚
ingestion ──┼───────●────────┼───────●─────────┼───────●──────────┼─────→
β”‚ β”‚ β”‚ β”‚
processing ─┴────────────────●─────────────────●──────────────────●─────→
● = same event observed in three different clocks

A query written against processing time works on a clean local laptop and silently corrupts results in production, because real producers buffer, retry, and drift. Always anchor windowing and joins to event time. Both Flink and Spark default to event time when you call withWatermark / assign_timestamps_and_watermarks, but PyFlink’s processing_time window helpers are tempting and wrong for almost any real use case.

Watermarks: same idea, different knobs

A watermark in both engines answers the same question: β€œgiven the events I’ve seen so far, how far back in event time am I still willing to accept new ones?” Anything older than that line is treated as late.

That much is shared. The practical differences are about how many knobs you get and what happens to data past the threshold:

Spark Structured StreamingFlink
Knobsone threshold via withWatermark("ts", "10 min")three layered knobs (see below)
Late events within the thresholdincluded in the windowincluded in the window
Late events past the thresholddropped silentlydrop, OR keep window open, OR route to DLQ β€” your choice
Watermark lives atthe query / stateful operatorevery operator (it travels alongside the data)

Flink’s three knobs:

  1. Bounded out-of-orderness β€” how far behind the max-seen event time the watermark sits. This is the equivalent of Spark’s single threshold.
  2. Allowed lateness β€” keep the window open after the watermark passes, so further-late events still update an already-emitted result.
  3. Side output β€” route events past allowed lateness to a separate stream you can write to a DLQ instead of dropping them.
Spark: one threshold, then drop
max event_time seen ─────────────────────●
β”‚
◀──── threshold ──────
β”‚
──── on-time ──────────────▢│ watermark │── late ──▢ DROPPED
β”‚
β–Ό
Flink: three layered knobs, drop is opt-in not default
──── on-time ──────▢│ wm │── late, ≀ allowed_lateness ──▢ UPDATE window
β”‚
past allowed_lateness ──▢ side output ──▢ DLQ

The takeaway: if you ship the same job in both engines with default settings, Spark will silently lose more late events than Flink β€” not because the concept differs, but because Spark gives you one knob where Flink gives you three. Set Spark’s threshold conservatively, or build a manual late-data filter (next section).

β”Œβ”€β”€ on-time ───▢ window aggregation ──▢ main sink
keyed stream ──▢ window
└── past watermark ──▢ side output (OutputTag) ──▢ DLQ

PyFlink:

from pyflink.common import Types, Time
from pyflink.datastream import OutputTag
from pyflink.datastream.window import TumblingEventTimeWindows
late_tag = OutputTag(
"late-events",
Types.ROW([Types.STRING(), Types.LONG()]),
)
windowed = (
stream
.key_by(lambda e: e.key)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.allowed_lateness(Time.minutes(1))
.side_output_late_data(late_tag)
.process(MyAggregator())
)
late_stream = windowed.get_side_output(late_tag)
late_stream.add_sink(dlq_sink) # never lose a late event

Three years of running this pattern: roughly 0.1–2% of events arrive past the watermark in any non-trivial pipeline. Without a side output, that’s silent data loss.

Spark Structured Streaming has no equivalent. Records past the watermark are dropped at the stateful operator with no hook to capture them. The workaround is to compute lateness yourself before the window and route in PySpark:

from pyspark.sql import functions as F
watermark_seconds = 600 # must match what you pass to .withWatermark()
events = (
spark.readStream.format("kafka").option(...).load()
.selectExpr("CAST(value AS STRING) as payload", "timestamp as event_time")
)
late = events.filter(
(F.unix_timestamp(F.current_timestamp()) - F.unix_timestamp("event_time"))
> watermark_seconds
)
on_time = events.subtract(late)
# write `late` to a DLQ table; run windowed agg on `on_time`

It’s not as clean as Flink’s side output (no per-window watermark β€” you’re approximating), but it stops the silent-drop problem for the common case.

Stream-stream joins are constrained, not free

The tempting mental model is β€œtwo infinite streams, join when keys match, emit a row.” Reality:

  1. Both sides need watermarks β€” without bounds on lateness, the join state grows forever.
  2. Joins need an event-time window predicate (leftTime BETWEEN rightTime - 10 minutes AND rightTime + 5 minutes). Without it, every left-side row sits in state waiting for any right-side row, ever.
  3. Outer joins emit nulls only when the watermark passes the window β€” your β€œleft outer join” produces left-only rows on a delay matching the watermark, not immediately. This trips up alerting use cases (β€œalert when click happens with no purchase”) regularly.

In Flink: IntervalJoin enforces #2 explicitly. In Spark: stream-stream outer joins in append mode require watermarks on both sides plus a time-range predicate β€” the planner rejects the query otherwise. Inner joins are permitted without them, but state grows without bound, which is its own footgun. Either way, set the watermarks; the planner errors are the friendly version of the problem.

Session windows: the easiest way to balloon state

Session windows close after a gap of inactivity per key. Two failure modes:

Long-tail keys never close. A bot pings your endpoint every 4 minutes for a week. With a 5-minute gap, that β€œsession” is open for 7 days, holding state the whole time. Cap session length explicitly, or use a periodic timer to force-close.

Sessions merge unexpectedly. Two events 4 minutes apart with a 5-minute gap window create one session. Add a 4-minute-late event that lands between them, gap is now 4-minute-3-minute-3-minute, and the engine merges them retroactively into one session, emitting a retraction + re-emit downstream. Make sure your sink handles updates, not just appends.

Exactly-once is end-to-end, not engine-only

This one bites the most. People read β€œSpark Structured Streaming supports exactly-once” and ship to production assuming dedup is solved. The actual guarantee:

The actionable rule: assume at-least-once for any stream β†’ external sink, and design idempotent writes (MERGE on event_id) regardless of what the engine claims. Engine-level exactly-once protects state recovery on restart; it does not, by itself, dedupe across the wire to your warehouse.

Backpressure is invisible until it isn’t

When the sink is slower than the source, something has to give. The mechanisms differ wildly:

Flink: backpressure visible as climbing pool usage upstream
source ─[credits]─▢ map ─[credits=0]─▢ slow sink
β–² β”‚
└── stalls β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
outPoolUsage β†’ 1.0 (the alert signal)
Spark: backpressure shows up as growing offset lag and trigger overruns
trigger=30s β”‚ batch_n β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 45s β”‚
β–²
batch_n+1 starts late,
backlog grows, processedRowsPerSecond
drops below inputRowsPerSecond

Both engines will eventually fall behind silently. Always alert on (latest source offset) - (committed offset). That single metric catches more incidents than any other streaming alert I’ve set up.

Schema evolution mid-stream

Producers will deploy a new schema version while your job is running. Three things happen:

  1. Old schema events are still in transit (Kafka retention).
  2. New schema events start arriving.
  3. Your deserialiser was compiled against the old schema.

Survival requires (a) a schema registry with backward-compatibility checks enforced in CI, (b) DESERIALISATION_FAILURE β†’ DLQ rather than crash-on-bad-message, and (c) a periodic CI replay of β€œlast 7 days of events through current code” to catch changes that pass compatibility checks but break business logic.

When batch is genuinely the right answer

A few signals that the streaming framing is wrong:

Streaming buys you continuous latency. If your downstream consumer doesn’t actually need that, you’re paying for it anyway.

Closing

The surface API of streaming engines is small. The corner cases β€” watermark semantics, late-data routing, join constraints, session merges, end-to-end exactly-once, backpressure, schema drift β€” are where production lives.

The minimum bar I now apply to any streaming job before it ships: explicit watermark with measured p99 lateness, side-output or filter-and-DLQ for late records, idempotent sink keyed on event_id, alert on offset lag, and a quarterly replay test. That’s not the whole list; it’s the floor.