post data-engineering Β· 2026-05-03 Β· 7 min read
Streaming pipelines: the corner cases that the docs gloss over
The first streaming job most teams ship is a batch job translated naively. The second is a real streaming job that quietly drops events. The third is the one where youβve actually read the watermark documentation, set up side outputs, and instrumented enough to know when a sink is silently lying to you.
This post is the gap between what the docs say and what bites β corner cases Iβve hit across Flink 1.20 and Spark Structured Streaming 3.5/4.x that donβt show up in the βgetting startedβ page.
Time-domain confusion (the prerequisite mistake)
Three timestamps live on every event:
- event time β when the thing actually happened (clock on the producing device)
- ingestion time β when the broker received it
- processing time β when your operator sees it
producer broker operator buffer/retry partition lag scheduling jitter β β βevent-time ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β βΌ β βΌ β βΌ βingestion βββΌβββββββββββββββββΌββββββββββββββββββΌβββββββββββββββββββΌββββββ β β β βprocessing ββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β = same event observed in three different clocksA query written against processing time works on a clean local laptop and silently corrupts results in production, because real producers buffer, retry, and drift. Always anchor windowing and joins to event time. Both Flink and Spark default to event time when you call withWatermark / assign_timestamps_and_watermarks, but PyFlinkβs processing_time window helpers are tempting and wrong for almost any real use case.
Watermarks: same idea, different knobs
A watermark in both engines answers the same question: βgiven the events Iβve seen so far, how far back in event time am I still willing to accept new ones?β Anything older than that line is treated as late.
That much is shared. The practical differences are about how many knobs you get and what happens to data past the threshold:
| Spark Structured Streaming | Flink | |
|---|---|---|
| Knobs | one threshold via withWatermark("ts", "10 min") | three layered knobs (see below) |
| Late events within the threshold | included in the window | included in the window |
| Late events past the threshold | dropped silently | drop, OR keep window open, OR route to DLQ β your choice |
| Watermark lives at | the query / stateful operator | every operator (it travels alongside the data) |
Flinkβs three knobs:
- Bounded out-of-orderness β how far behind the max-seen event time the watermark sits. This is the equivalent of Sparkβs single threshold.
- Allowed lateness β keep the window open after the watermark passes, so further-late events still update an already-emitted result.
- Side output β route events past allowed lateness to a separate stream you can write to a DLQ instead of dropping them.
Spark: one threshold, then drop
max event_time seen ββββββββββββββββββββββ β βββββ threshold ββββββ€ β ββββ on-time βββββββββββββββΆβ watermark βββ late βββΆ DROPPED β βΌ
Flink: three layered knobs, drop is opt-in not default
ββββ on-time βββββββΆβ wm βββ late, β€ allowed_lateness βββΆ UPDATE window β past allowed_lateness βββΆ side output βββΆ DLQThe takeaway: if you ship the same job in both engines with default settings, Spark will silently lose more late events than Flink β not because the concept differs, but because Spark gives you one knob where Flink gives you three. Set Sparkβs threshold conservatively, or build a manual late-data filter (next section).
Late-data side outputs (PyFlink) and the Spark workaround
βββ on-time ββββΆ window aggregation βββΆ main sink keyed stream βββΆ window βββ past watermark βββΆ side output (OutputTag) βββΆ DLQPyFlink:
from pyflink.common import Types, Timefrom pyflink.datastream import OutputTagfrom pyflink.datastream.window import TumblingEventTimeWindows
late_tag = OutputTag( "late-events", Types.ROW([Types.STRING(), Types.LONG()]),)
windowed = ( stream .key_by(lambda e: e.key) .window(TumblingEventTimeWindows.of(Time.minutes(5))) .allowed_lateness(Time.minutes(1)) .side_output_late_data(late_tag) .process(MyAggregator()))
late_stream = windowed.get_side_output(late_tag)late_stream.add_sink(dlq_sink) # never lose a late eventThree years of running this pattern: roughly 0.1β2% of events arrive past the watermark in any non-trivial pipeline. Without a side output, thatβs silent data loss.
Spark Structured Streaming has no equivalent. Records past the watermark are dropped at the stateful operator with no hook to capture them. The workaround is to compute lateness yourself before the window and route in PySpark:
from pyspark.sql import functions as F
watermark_seconds = 600 # must match what you pass to .withWatermark()
events = ( spark.readStream.format("kafka").option(...).load() .selectExpr("CAST(value AS STRING) as payload", "timestamp as event_time"))
late = events.filter( (F.unix_timestamp(F.current_timestamp()) - F.unix_timestamp("event_time")) > watermark_seconds)on_time = events.subtract(late)
# write `late` to a DLQ table; run windowed agg on `on_time`Itβs not as clean as Flinkβs side output (no per-window watermark β youβre approximating), but it stops the silent-drop problem for the common case.
Stream-stream joins are constrained, not free
The tempting mental model is βtwo infinite streams, join when keys match, emit a row.β Reality:
- Both sides need watermarks β without bounds on lateness, the join state grows forever.
- Joins need an event-time window predicate (
leftTime BETWEEN rightTime - 10 minutes AND rightTime + 5 minutes). Without it, every left-side row sits in state waiting for any right-side row, ever. - Outer joins emit nulls only when the watermark passes the window β your βleft outer joinβ produces left-only rows on a delay matching the watermark, not immediately. This trips up alerting use cases (βalert when click happens with no purchaseβ) regularly.
In Flink: IntervalJoin enforces #2 explicitly. In Spark: stream-stream outer joins in append mode require watermarks on both sides plus a time-range predicate β the planner rejects the query otherwise. Inner joins are permitted without them, but state grows without bound, which is its own footgun. Either way, set the watermarks; the planner errors are the friendly version of the problem.
Session windows: the easiest way to balloon state
Session windows close after a gap of inactivity per key. Two failure modes:
Long-tail keys never close. A bot pings your endpoint every 4 minutes for a week. With a 5-minute gap, that βsessionβ is open for 7 days, holding state the whole time. Cap session length explicitly, or use a periodic timer to force-close.
Sessions merge unexpectedly. Two events 4 minutes apart with a 5-minute gap window create one session. Add a 4-minute-late event that lands between them, gap is now 4-minute-3-minute-3-minute, and the engine merges them retroactively into one session, emitting a retraction + re-emit downstream. Make sure your sink handles updates, not just appends.
Exactly-once is end-to-end, not engine-only
This one bites the most. People read βSpark Structured Streaming supports exactly-onceβ and ship to production assuming dedup is solved. The actual guarantee:
-
Spark Structured Streaming: exactly-once into the file sink (atomic commit via
_spark_metadata). The Kafka sink is at-least-once. Per the official docs, retried writes can produce duplicates. Workaround: enable Kafkaβs idempotent producer (kafka.enable.idempotence=true) and dedup on read using a unique event id. -
Flink: exactly-once via two-phase commit checkpoints, but only when paired with a sink that participates. The modern
KafkaSinkrequiresDeliveryGuarantee.EXACTLY_ONCEplus a uniquetransactionalIdPrefix; Iceberg sinks commit on checkpoint; JDBC has an XA-based exactly-once variant. A vanillaaddSinkwith no transactional support is at-least-once.
The actionable rule: assume at-least-once for any stream β external sink, and design idempotent writes (MERGE on event_id) regardless of what the engine claims. Engine-level exactly-once protects state recovery on restart; it does not, by itself, dedupe across the wire to your warehouse.
Backpressure is invisible until it isnβt
When the sink is slower than the source, something has to give. The mechanisms differ wildly:
- Flink: credit-based flow control. Slow sinks pause upstream operators. You see it as
inPoolUsage/outPoolUsageclimbing in the dashboard. - Spark Structured Streaming: micro-batch sizing. If a batch takes longer than the trigger interval, the next batch waits. With no
maxOffsetsPerTrigger, a backlog dumps gigabytes into one batch and the job appears frozen.
Flink: backpressure visible as climbing pool usage upstream
source β[credits]ββΆ map β[credits=0]ββΆ slow sink β² β βββ stalls ββββββββββββββ outPoolUsage β 1.0 (the alert signal)
Spark: backpressure shows up as growing offset lag and trigger overruns
trigger=30s β batch_n ββββββββ 45s β β² batch_n+1 starts late, backlog grows, processedRowsPerSecond drops below inputRowsPerSecondBoth engines will eventually fall behind silently. Always alert on (latest source offset) - (committed offset). That single metric catches more incidents than any other streaming alert Iβve set up.
Schema evolution mid-stream
Producers will deploy a new schema version while your job is running. Three things happen:
- Old schema events are still in transit (Kafka retention).
- New schema events start arriving.
- Your deserialiser was compiled against the old schema.
Survival requires (a) a schema registry with backward-compatibility checks enforced in CI, (b) DESERIALISATION_FAILURE β DLQ rather than crash-on-bad-message, and (c) a periodic CI replay of βlast 7 days of events through current codeβ to catch changes that pass compatibility checks but break business logic.
When batch is genuinely the right answer
A few signals that the streaming framing is wrong:
- Latency tolerance > 30 minutes β a scheduled hourly job is simpler, cheaper, and easier to backfill.
- Multi-table joins involving slowly-changing dimensions β batch joins are saner than orchestrating watermarks across five streams.
- Cost-sensitive workloads on cloud β keeping a Flink/Spark cluster warm 24/7 dwarfs the cost of an hourly serverless batch run.
Streaming buys you continuous latency. If your downstream consumer doesnβt actually need that, youβre paying for it anyway.
Closing
The surface API of streaming engines is small. The corner cases β watermark semantics, late-data routing, join constraints, session merges, end-to-end exactly-once, backpressure, schema drift β are where production lives.
The minimum bar I now apply to any streaming job before it ships: explicit watermark with measured p99 lateness, side-output or filter-and-DLQ for late records, idempotent sink keyed on event_id, alert on offset lag, and a quarterly replay test. Thatβs not the whole list; itβs the floor.