post data-engineering Β· 2025-10-08 Β· 6 min read
Spark partition sizing: what I learned the hard way
I once spent an afternoon convinced our 200-line Spark pipeline had a bug, because adding 10% more data tripled the runtime. The bug was the data itself fit awkwardly into Sparkβs default partition count. The pipeline was fine. The math wasnβt. Once I learned the rules around partition sizing the runtime collapsed back. This post is the field guide to spare you that afternoon.
The default trap
Sparkβs default spark.sql.shuffle.partitions is 200. It has been 200 since around Spark 1.x. Nobody changed it.
your data size β shuffle β resulting partitionsβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ50 MB 200, each β 250 KB β way too many tiny partitions5 GB 200, each β 25 MB β starting to be okay50 GB 200, each β 250 MB β too few, stragglers500 GB 200, each β 2.5 GB β OOMs likelyThe rule of thumb: 128-256 MB per partition is the sweet spot for most jobs. Smaller wastes scheduling overhead; larger risks GC and OOM.
For 50 GB of data, you want roughly 200-400 partitions, which is fine with the default. For 500 GB you want 2,000-4,000. For 50 MB you want 1, maybe 2.
Diagnose first
Before tuning, look at what the partitions actually look like:
from pyspark.sql import functions as F
df.groupBy(F.spark_partition_id().alias("pid")) \ .count() \ .orderBy(F.desc("count")) \ .show(10)Three patterns I look for:
1. Many tiny partitions:
+---+--------+|pid| count|+---+--------+| 0| 1247|| 1| 1192|| 2| 1183|... (197 more like this)Spark has scheduling overhead per partition (~1-10 ms). With 200 partitions of 1k rows each, youβre spending more time scheduling than processing. Coalesce.
2. Skewed partitions:
+---+---------+|pid| count|+---+---------+| 3|6_834_192| β one partition has 90% of the data| 0| 12_443|| 1| 12_117|... (197 more small)This is data skew, covered in detail in PySpark skew detection. Different problem, different fix (salt, broadcast, AQE skew-join).
3. Right-sized partitions:
+---+---------+|pid| count|+---+---------+| 0| 423_192|| 1| 421_117|| 12| 419_847|... (median = max Β± 5%)Rare, ideal. Leave alone.
repartition vs coalesce: when to use which
Both change partition count. Theyβre not interchangeable.
ββββββββββββββββββββββββββββββββββββββββββββββββ β β repartition(N) β ββββ ββββ ββββ ββββ ββββ ββββ ββββ ββββ β - full shuffle β βA β βB β βC β βD β βE β βF β βG β βH β β - even distribution β ββββ ββββ ββββ ββββ ββββ ββββ ββββ ββββ β - works up OR down β β β β β β β β β β β βββββββββββ βββββββββββ βββββββββββ β β β AB β β CDEF β β GH β β β βββββββββββ βββββββββββ βββββββββββ β β β shuffles all rows across the wire β ββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββ β β coalesce(N) β ββββ ββββ ββββ ββββ ββββ ββββ ββββ ββββ β - no shuffle β βA β βB β βC β βD β βE β βF β βG β βH β β - just merges adjacent β ββββ ββββ ββββ ββββ ββββ ββββ ββββ ββββ β - down only β βββA+Bββ βββC+Dββ βββE+Fββ βββG+Hββ β - may produce skew β β no rows move; partitions stitched β ββββββββββββββββββββββββββββββββββββββββββββββββUse repartition(N) when:
- Youβre going up (200 β 1000)
- Existing partitions are skewed and you want even distribution
- You care more about balance than shuffle cost
Use coalesce(N) when:
- Youβre going down (1000 β 100), AND existing partitions are roughly even
- You want to avoid the shuffle cost
- Common use: writing many small partitions to disk β coalesce to fewer files
# many small files becoming one big file at write timeresult.coalesce(1).write.parquet("/output/single_file.parquet") # only safe if data is small
# rebalancing skewed partitions before a heavy operationeven = skewed_df.repartition(400, "user_id")even.groupBy("user_id").agg(...).write...The most common mistake: using coalesce(1) on a 100 GB dataset to get one output file. That collapses all 100 GB onto one executor and OOMs. Use repartition(1) if you really need one file (still slow), or live with N files.
When AQE saves you (and when it doesnβt)
Adaptive Query Execution (Spark 3.0+) can dynamically merge small partitions and split large ones at runtime. Two flags to enable:
spark.conf.set("spark.sql.adaptive.enabled", "true")spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")What this does: after each shuffle stage, AQE looks at the partition sizes and merges adjacent small ones into target-size chunks (default 64 MB).
When AQE saves you:
- Aggregations that produce skewed-but-fixable output sizes
- Joins where one side is much smaller than expected
- Ad-hoc queries on data of unknown size
When AQE does not save you:
- The very first read from disk before any shuffle (AQE only kicks in post-shuffle)
- Skewed data within a partition (you need
skewJoin.enabledfor that) - Workloads where the heuristic mispicks (rare in 3.4+, common in 3.0)
A common confusion: βAQE is on, why are my partitions still wrong?β Because AQE only acts post-shuffle. If your initial read produces 20,000 tiny partitions, AQE doesnβt see them. You need repartition() after reading.
The diagnostic loop I run
1. Run the job. Note the runtime.2. Open Spark UI β Stages β look at the slowest stage.3. Click into it. Look at: - Number of tasks (= number of partitions for that stage) - Median vs max task duration - Median vs max input/output bytes4. Compute: input bytes / number of partitions = avg partition size - Below 10 MB: too many partitions, coalesce - 64β256 MB: about right, leave alone - Above 512 MB: too few, repartition up5. If max input bytes is >> median input bytes: - That's skew, see the skew-detection post6. Apply the fix, re-run, compare.The loop usually closes within 2-3 iterations on a problem job.
Patterns I now reach for
Pattern 1: explicit repartition after read
df = spark.read.parquet("/data/source") \ .repartition(400, "user_id") # explicit, even, by-key# downstream operations now have predictable partition layoutBeats letting Spark guess from input file layout.
Pattern 2: target a partition size, not a count
total_size_bytes = spark.read.parquet(path).queryExecution.optimizedPlan.stats.sizeInBytestarget_partition_size_mb = 128target_count = max(1, int(total_size_bytes / (target_partition_size_mb * 1024 * 1024)))
df.repartition(target_count, "key")This auto-scales as data grows. Hard-coding repartition(200) is a time bomb for the day data triples.
Pattern 3: coalesce only at write time
# Compute with whatever partition count makes the work fastresult = big_df.groupBy("k").agg(...)
# Coalesce only when writing to disk, to avoid 1000 small filesresult.coalesce(20).write.parquet("/output")coalesce here doesnβt shuffle (since weβre going down), and you get 20 reasonable-size output files instead of 200 tiny ones.
What I no longer do
- Setting
shuffle.partitions = 1000globally for βspeedβ. Wrong. Set it per-job based on data size. coalesce(1).writefor big data. Single-executor OOM waiting to happen.- Trusting AQE to fix everything. AQE is great post-shuffle but itβs not magic. The initial read still needs sane partitioning.
The TL;DR
- Default 200 is wrong for most data sizes. Compute the right count from data size / 128 MB.
repartitionshuffles (slow but balanced).coalescedoesnβt (fast but can keep skew).- AQE merges small post-shuffle partitions; turn it on, but donβt rely on it for the initial read.
- Look at partition record counts before tuning. Most βslow Sparkβ is one of three problems: too many partitions, too few partitions, or skew. The diagnostic loop tells you which.
Get this right and βthe cluster needs more nodesβ becomes βthe cluster was already fine, the partitions werenβtβ most of the time.