post data-engineering · 2025-10-08 · 6 min read

Spark partition sizing: what I learned the hard way

#pyspark#spark#performance#data-engineering#learnings

I once spent an afternoon convinced our 200-line Spark pipeline had a bug, because adding 10% more data tripled the runtime. The bug was the data itself fit awkwardly into Spark’s default partition count. The pipeline was fine. The math wasn’t. Once I learned the rules around partition sizing the runtime collapsed back. This post is the field guide to spare you that afternoon.

The default trap

Spark’s default spark.sql.shuffle.partitions is 200. It has been 200 since around Spark 1.x. Nobody changed it.

your data size              ↓ shuffle ↓                resulting partitions
─────────────────────────────────────────────────────────────────
50 MB                                                  200, each ≈ 250 KB ← way too many tiny partitions
5 GB                                                   200, each ≈ 25 MB ← starting to be okay
50 GB                                                  200, each ≈ 250 MB ← too few, stragglers
500 GB                                                 200, each ≈ 2.5 GB ← OOMs likely

The rule of thumb: 128-256 MB per partition is the sweet spot for most jobs. Smaller wastes scheduling overhead; larger risks GC and OOM.

For 50 GB of data, you want roughly 200-400 partitions, which is fine with the default. For 500 GB you want 2,000-4,000. For 50 MB you want 1, maybe 2.

Diagnose first

Before tuning, look at what the partitions actually look like:

from pyspark.sql import functions as F

df.groupBy(F.spark_partition_id().alias("pid")) \
  .count() \
  .orderBy(F.desc("count")) \
  .show(10)

Three patterns I look for:

1. Many tiny partitions:

+---+--------+
|pid|   count|
+---+--------+
|  0|    1247|
|  1|    1192|
|  2|    1183|
... (197 more like this)

Spark has scheduling overhead per partition (~1-10 ms). With 200 partitions of 1k rows each, you’re spending more time scheduling than processing. Coalesce.

2. Skewed partitions:

+---+---------+
|pid|    count|
+---+---------+
|  3|6_834_192| ← one partition has 90% of the data
|  0|   12_443|
|  1|   12_117|
... (197 more small)

This is data skew, covered in detail in PySpark skew detection. Different problem, different fix (salt, broadcast, AQE skew-join).

3. Right-sized partitions:

+---+---------+
|pid|    count|
+---+---------+
|  0|  423_192|
|  1|  421_117|
| 12|  419_847|
... (median = max ± 5%)

Rare, ideal. Leave alone.

`repartition` vs `coalesce`: when to use which

Both change partition count. They’re not interchangeable.

                          ┌──────────────────────────────────────────────┐
                          │                                              │
   repartition(N)         │   ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐    │
   - full shuffle          │   │A │ │B │ │C │ │D │ │E │ │F │ │G │ │H │   │
   - even distribution    │   └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘    │
   - works up OR down     │     ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    │
                          │   ┌─────────┐ ┌─────────┐ ┌─────────┐         │
                          │   │   AB    │ │   CDEF  │ │   GH    │         │
                          │   └─────────┘ └─────────┘ └─────────┘         │
                          │   ↑ shuffles all rows across the wire          │
                          └──────────────────────────────────────────────┘

                          ┌──────────────────────────────────────────────┐
                          │                                              │
   coalesce(N)            │   ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐    │
   - no shuffle            │   │A │ │B │ │C │ │D │ │E │ │F │ │G │ │H │   │
   - just merges adjacent │   └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘    │
   - down only             │   └──A+B─┘ └──C+D─┘ └──E+F─┘ └──G+H─┘       │
   - may produce skew     │   ↑ no rows move; partitions stitched         │
                          └──────────────────────────────────────────────┘

Use repartition(N) when:

You’re going up (200 → 1000)
Existing partitions are skewed and you want even distribution
You care more about balance than shuffle cost

Use coalesce(N) when:

You’re going down (1000 → 100), AND existing partitions are roughly even
You want to avoid the shuffle cost
Common use: writing many small partitions to disk → coalesce to fewer files

# many small files becoming one big file at write time
result.coalesce(1).write.parquet("/output/single_file.parquet")  # only safe if data is small

# rebalancing skewed partitions before a heavy operation
even = skewed_df.repartition(400, "user_id")
even.groupBy("user_id").agg(...).write...

The most common mistake: using coalesce(1) on a 100 GB dataset to get one output file. That collapses all 100 GB onto one executor and OOMs. Use repartition(1) if you really need one file (still slow), or live with N files.

When AQE saves you (and when it doesn’t)

Adaptive Query Execution (Spark 3.0+) can dynamically merge small partitions and split large ones at runtime. Two flags to enable:

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

What this does: after each shuffle stage, AQE looks at the partition sizes and merges adjacent small ones into target-size chunks (default 64 MB).

When AQE saves you:

Aggregations that produce skewed-but-fixable output sizes
Joins where one side is much smaller than expected
Ad-hoc queries on data of unknown size

When AQE does not save you:

The very first read from disk before any shuffle (AQE only kicks in post-shuffle)
Skewed data within a partition (you need skewJoin.enabled for that)
Workloads where the heuristic mispicks (rare in 3.4+, common in 3.0)

A common confusion: “AQE is on, why are my partitions still wrong?” Because AQE only acts post-shuffle. If your initial read produces 20,000 tiny partitions, AQE doesn’t see them. You need repartition() after reading.

The diagnostic loop I run

1. Run the job. Note the runtime.
2. Open Spark UI → Stages → look at the slowest stage.
3. Click into it. Look at:
   - Number of tasks (= number of partitions for that stage)
   - Median vs max task duration
   - Median vs max input/output bytes
4. Compute: input bytes / number of partitions = avg partition size
   - Below 10 MB: too many partitions, coalesce
   - 64–256 MB: about right, leave alone
   - Above 512 MB: too few, repartition up
5. If max input bytes is >> median input bytes:
   - That's skew, see the skew-detection post
6. Apply the fix, re-run, compare.

The loop usually closes within 2-3 iterations on a problem job.

Patterns I now reach for

Pattern 1: explicit repartition after read

df = spark.read.parquet("/data/source") \
    .repartition(400, "user_id")  # explicit, even, by-key
# downstream operations now have predictable partition layout

Beats letting Spark guess from input file layout.

Pattern 2: target a partition size, not a count

total_size_bytes = spark.read.parquet(path).queryExecution.optimizedPlan.stats.sizeInBytes
target_partition_size_mb = 128
target_count = max(1, int(total_size_bytes / (target_partition_size_mb * 1024 * 1024)))

df.repartition(target_count, "key")

This auto-scales as data grows. Hard-coding repartition(200) is a time bomb for the day data triples.

Pattern 3: coalesce only at write time

# Compute with whatever partition count makes the work fast
result = big_df.groupBy("k").agg(...)

# Coalesce only when writing to disk, to avoid 1000 small files
result.coalesce(20).write.parquet("/output")

coalesce here doesn’t shuffle (since we’re going down), and you get 20 reasonable-size output files instead of 200 tiny ones.

What I no longer do

Setting shuffle.partitions = 1000 globally for “speed”. Wrong. Set it per-job based on data size.
coalesce(1).write for big data. Single-executor OOM waiting to happen.
Trusting AQE to fix everything. AQE is great post-shuffle but it’s not magic. The initial read still needs sane partitioning.

The TL;DR

Default 200 is wrong for most data sizes. Compute the right count from data size / 128 MB.
repartition shuffles (slow but balanced). coalesce doesn’t (fast but can keep skew).
AQE merges small post-shuffle partitions; turn it on, but don’t rely on it for the initial read.
Look at partition record counts before tuning. Most “slow Spark” is one of three problems: too many partitions, too few partitions, or skew. The diagnostic loop tells you which.

Get this right and “the cluster needs more nodes” becomes “the cluster was already fine, the partitions weren’t” most of the time.