post data-engineering · 2025-10-08 · 6 min read
Spark partition sizing: what I learned the hard way
I once spent an afternoon convinced our 200-line Spark pipeline had a bug, because adding 10% more data tripled the runtime. The bug was the data itself fit awkwardly into Spark’s default partition count. The pipeline was fine. The math wasn’t. Once I learned the rules around partition sizing the runtime collapsed back. This post is the field guide to spare you that afternoon.
The default trap
Spark’s default spark.sql.shuffle.partitions is 200. It has been 200 since around Spark 1.x. Nobody changed it.
your data size ↓ shuffle ↓ resulting partitions─────────────────────────────────────────────────────────────────50 MB 200, each ≈ 250 KB ← way too many tiny partitions5 GB 200, each ≈ 25 MB ← starting to be okay50 GB 200, each ≈ 250 MB ← too few, stragglers500 GB 200, each ≈ 2.5 GB ← OOMs likelyThe rule of thumb: 128-256 MB per partition is the sweet spot for most jobs. Smaller wastes scheduling overhead; larger risks GC and OOM.
For 50 GB of data, you want roughly 200-400 partitions, which is fine with the default. For 500 GB you want 2,000-4,000. For 50 MB you want 1, maybe 2.
Diagnose first
Before tuning, look at what the partitions actually look like:
from pyspark.sql import functions as F
df.groupBy(F.spark_partition_id().alias("pid")) \ .count() \ .orderBy(F.desc("count")) \ .show(10)Three patterns I look for:
1. Many tiny partitions:
+---+--------+|pid| count|+---+--------+| 0| 1247|| 1| 1192|| 2| 1183|... (197 more like this)Spark has scheduling overhead per partition (~1-10 ms). With 200 partitions of 1k rows each, you’re spending more time scheduling than processing. Coalesce.
2. Skewed partitions:
+---+---------+|pid| count|+---+---------+| 3|6_834_192| ← one partition has 90% of the data| 0| 12_443|| 1| 12_117|... (197 more small)This is data skew, covered in detail in PySpark skew detection. Different problem, different fix (salt, broadcast, AQE skew-join).
3. Right-sized partitions:
+---+---------+|pid| count|+---+---------+| 0| 423_192|| 1| 421_117|| 12| 419_847|... (median = max ± 5%)Rare, ideal. Leave alone.
repartition vs coalesce: when to use which
Both change partition count. They’re not interchangeable.
┌──────────────────────────────────────────────┐ │ │ repartition(N) │ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ │ - full shuffle │ │A │ │B │ │C │ │D │ │E │ │F │ │G │ │H │ │ - even distribution │ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ │ - works up OR down │ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ AB │ │ CDEF │ │ GH │ │ │ └─────────┘ └─────────┘ └─────────┘ │ │ ↑ shuffles all rows across the wire │ └──────────────────────────────────────────────┘
┌──────────────────────────────────────────────┐ │ │ coalesce(N) │ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ │ - no shuffle │ │A │ │B │ │C │ │D │ │E │ │F │ │G │ │H │ │ - just merges adjacent │ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ │ - down only │ └──A+B─┘ └──C+D─┘ └──E+F─┘ └──G+H─┘ │ - may produce skew │ ↑ no rows move; partitions stitched │ └──────────────────────────────────────────────┘Use repartition(N) when:
- You’re going up (200 → 1000)
- Existing partitions are skewed and you want even distribution
- You care more about balance than shuffle cost
Use coalesce(N) when:
- You’re going down (1000 → 100), AND existing partitions are roughly even
- You want to avoid the shuffle cost
- Common use: writing many small partitions to disk → coalesce to fewer files
# many small files becoming one big file at write timeresult.coalesce(1).write.parquet("/output/single_file.parquet") # only safe if data is small
# rebalancing skewed partitions before a heavy operationeven = skewed_df.repartition(400, "user_id")even.groupBy("user_id").agg(...).write...The most common mistake: using coalesce(1) on a 100 GB dataset to get one output file. That collapses all 100 GB onto one executor and OOMs. Use repartition(1) if you really need one file (still slow), or live with N files.
When AQE saves you (and when it doesn’t)
Adaptive Query Execution (Spark 3.0+) can dynamically merge small partitions and split large ones at runtime. Two flags to enable:
spark.conf.set("spark.sql.adaptive.enabled", "true")spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")What this does: after each shuffle stage, AQE looks at the partition sizes and merges adjacent small ones into target-size chunks (default 64 MB).
When AQE saves you:
- Aggregations that produce skewed-but-fixable output sizes
- Joins where one side is much smaller than expected
- Ad-hoc queries on data of unknown size
When AQE does not save you:
- The very first read from disk before any shuffle (AQE only kicks in post-shuffle)
- Skewed data within a partition (you need
skewJoin.enabledfor that) - Workloads where the heuristic mispicks (rare in 3.4+, common in 3.0)
A common confusion: “AQE is on, why are my partitions still wrong?” Because AQE only acts post-shuffle. If your initial read produces 20,000 tiny partitions, AQE doesn’t see them. You need repartition() after reading.
The diagnostic loop I run
1. Run the job. Note the runtime.2. Open Spark UI → Stages → look at the slowest stage.3. Click into it. Look at: - Number of tasks (= number of partitions for that stage) - Median vs max task duration - Median vs max input/output bytes4. Compute: input bytes / number of partitions = avg partition size - Below 10 MB: too many partitions, coalesce - 64–256 MB: about right, leave alone - Above 512 MB: too few, repartition up5. If max input bytes is >> median input bytes: - That's skew, see the skew-detection post6. Apply the fix, re-run, compare.The loop usually closes within 2-3 iterations on a problem job.
Patterns I now reach for
Pattern 1: explicit repartition after read
df = spark.read.parquet("/data/source") \ .repartition(400, "user_id") # explicit, even, by-key# downstream operations now have predictable partition layoutBeats letting Spark guess from input file layout.
Pattern 2: target a partition size, not a count
total_size_bytes = spark.read.parquet(path).queryExecution.optimizedPlan.stats.sizeInBytestarget_partition_size_mb = 128target_count = max(1, int(total_size_bytes / (target_partition_size_mb * 1024 * 1024)))
df.repartition(target_count, "key")This auto-scales as data grows. Hard-coding repartition(200) is a time bomb for the day data triples.
Pattern 3: coalesce only at write time
# Compute with whatever partition count makes the work fastresult = big_df.groupBy("k").agg(...)
# Coalesce only when writing to disk, to avoid 1000 small filesresult.coalesce(20).write.parquet("/output")coalesce here doesn’t shuffle (since we’re going down), and you get 20 reasonable-size output files instead of 200 tiny ones.
What I no longer do
- Setting
shuffle.partitions = 1000globally for “speed”. Wrong. Set it per-job based on data size. coalesce(1).writefor big data. Single-executor OOM waiting to happen.- Trusting AQE to fix everything. AQE is great post-shuffle but it’s not magic. The initial read still needs sane partitioning.
The TL;DR
- Default 200 is wrong for most data sizes. Compute the right count from data size / 128 MB.
repartitionshuffles (slow but balanced).coalescedoesn’t (fast but can keep skew).- AQE merges small post-shuffle partitions; turn it on, but don’t rely on it for the initial read.
- Look at partition record counts before tuning. Most “slow Spark” is one of three problems: too many partitions, too few partitions, or skew. The diagnostic loop tells you which.
Get this right and “the cluster needs more nodes” becomes “the cluster was already fine, the partitions weren’t” most of the time.