post python · 2025-12-03 · 7 min read

pandas vs Polars vs DuckDB: when each one actually wins

#python#pandas#polars#duckdb#data-analysis

The “should we still use pandas?” question came up three times in the last quarter alone. The truthful answer is “it depends, and pandas is fine for most things.” But the cases where it isn’t fine are increasingly important, and Polars and DuckDB have matured to the point where picking pandas by default is no longer obvious.

This post is the decision framework I use, with code samples and the actual properties of each tool that matter in production.

The thirty-second pitch for each

pandas: the API everyone knows. Eager evaluation, in-memory NumPy-backed. Works great up to a few GB on a single machine. Has the largest ecosystem of compatible tools (matplotlib, scikit-learn, statsmodels, etc.).

Polars: built in Rust, multi-threaded, lazy by default, and stricter about types. Often substantially faster than pandas on aggregation-heavy work, with lower peak memory; Polars’s own benchmarks and the H2O.ai-style groupby suites are the right place for hard numbers. Streaming engine handles datasets larger than RAM. Different (better, IMO) API.

DuckDB: an in-process analytical database. Speaks SQL. Reads CSV, Parquet, JSON, pandas DataFrames, Polars DataFrames, Arrow tables — anything tabular. Does query optimisation. Often outperforms both on aggregations and joins.

The decision framework

Decision
┌──────────────────────────┼───────────────────────────┐
│ │ │
Data fits Data is bigger than Mixed sources or
comfortably RAM, or you need you need SQL,
in memory predictable perf or you're joining
(<2 GB) (>2 GB) multiple files
│ │ │
▼ ▼ ▼
Use pandas if Use Polars Use DuckDB
the team knows it (lazy mode + streaming)
and ecosystem matters
Use Polars instead if
you care about speed
or strict types

Three axes that matter:

pandas: still the right pick for…

import pandas as pd
df = pd.read_csv("data.csv")
df["price_per_unit"] = df["price"] / df["units"]
result = df.groupby("region")["price_per_unit"].mean()

Where pandas wins:

Where pandas falls down:

Polars: built different, often the right answer

The headline difference: Polars is lazy by default. You build a query plan, the engine optimises it, then .collect() executes.

import polars as pl
result = (
pl.scan_csv("data.csv") # lazy: just plans the read
.with_columns(price_per_unit=pl.col("price") / pl.col("units"))
.group_by("region")
.agg(pl.col("price_per_unit").mean())
.collect() # actually run
)

Three properties that matter:

1. Lazy + optimisation. The engine sees the whole query before executing. It can push filters down to the file read, skip columns you don’t use, parallelise group-bys across cores. On a multi-core machine, this typically translates into a meaningful multiple over pandas on the same workload.

2. Strict types. Columns have a single dtype. No object fallback. Means more upfront friction (you’ll see “cannot cast utf8 to int64” errors that pandas would silently coerce), and far fewer “why is this column suddenly all NaN?” mysteries downstream.

3. Streaming engine (scan_csv + .collect(streaming=True)). Reads chunks, processes, discards. Handles datasets larger than RAM in constant memory.

Polars idioms that pandas users find surprising:

# Filter
df.filter(pl.col("age") > 30) # not df[df["age"] > 30]
# Multiple aggregations in one pass
df.group_by("region").agg(
pl.col("price").mean().alias("avg_price"),
pl.col("price").quantile(0.95).alias("p95_price"),
pl.len().alias("n_orders"),
)
# Window functions
df.with_columns(
rank=pl.col("score").rank().over("region"),
)

Where Polars wins:

Where Polars loses:

DuckDB: the database that thinks it’s a library

DuckDB is different. It’s not a DataFrame; it’s an analytical SQL engine that runs in-process. You don’t import a DataFrame, you query data.

import duckdb
# Query a CSV directly, no load step
result = duckdb.sql("""
SELECT region, AVG(price / units) AS avg_price_per_unit
FROM 'data.csv'
GROUP BY region
""").df() # convert result to pandas

DuckDB reads CSVs, Parquet files, JSON, even glob patterns (SELECT * FROM 'data/*.parquet'), and the query optimiser handles joins and aggregations smartly. On many workloads it beats Polars on aggregations and is competitive on filters.

Three things DuckDB does that nothing else does as well:

1. Multi-file federated queries.

duckdb.sql("""
SELECT u.user_id, e.event_count
FROM 'users/*.parquet' u
JOIN (SELECT user_id, COUNT(*) as event_count FROM 'events/*.parquet' GROUP BY user_id) e
ON u.user_id = e.user_id
WHERE u.country = 'NL'
""").df()

No load step. DuckDB scans the files, pushes the join through, returns the result. Often faster than reading both into Polars and joining.

2. SQL on pandas/Polars DataFrames.

import pandas as pd
import duckdb
df = pd.read_csv("data.csv")
result = duckdb.sql("SELECT region, AVG(price) FROM df GROUP BY region").df()

When you have a DataFrame in memory but the query is more naturally expressed as SQL, DuckDB queries it in place. Useful for analysts who think in SQL.

3. Embedded analytical database. If your app needs an analytics DB but you don’t want to run a separate Postgres / ClickHouse, DuckDB lives in your Python process. No network hop.

Where DuckDB wins:

Where DuckDB loses:

What the shape of a comparison usually looks like

A typical aggregation workload (a few GB of Parquet, group by a couple of columns, take top-K per group) shows three predictable properties on a multi-core machine:

# pandas
df = pd.read_parquet("data/")
result = (df.groupby(["country", "category"])["score"]
.nlargest(10)
.reset_index())
# Materialises everything in memory, single-core for most of the work.
# Memory and time scale roughly linearly with input size.
# Polars (lazy)
result = (pl.scan_parquet("data/")
.group_by(["country", "category"])
.agg(pl.col("score").top_k(10))
.collect())
# Lazy plan + multi-core execution. Pushes column pruning into the read.
# Typically a large speedup over pandas on this shape; lower peak memory.
# DuckDB
result = duckdb.sql("""
SELECT country, category, score
FROM (
SELECT country, category, score,
ROW_NUMBER() OVER (PARTITION BY country, category ORDER BY score DESC) AS rn
FROM 'data/*.parquet'
)
WHERE rn <= 10
""").df()
# Vectorised analytical engine + query planner. Often comparable to
# or faster than Polars on aggregation-heavy SQL shapes; lower peak memory.

The actual factor varies by workload, hardware, and Polars/DuckDB version. The ordering on aggregation-heavy work is what’s stable: pandas slowest, Polars and DuckDB much faster and lighter, with DuckDB usually edging Polars on SQL-shaped queries. The right place to get exact numbers for your size class is to run the workload yourself; published TPC-H and groupby benchmarks ballpark it.

What I actually do in production

I rarely use just one. A pipeline often reads with DuckDB (multi-file queries), processes the result with Polars (lazy chains), and emits a final pandas DataFrame for downstream tools. Each library is good at one thing.

What I no longer do

Closing

Three tools, three personalities. pandas is the lowest-cognitive-overhead option for known-small data. Polars is the right default for fresh work where speed and types matter. DuckDB is the SQL-shaped weapon for cross-file joins and analytical queries. The right answer is “the one that fits the shape of your problem”, and for any non-trivial data work, knowing all three gets you 5-10x speedups for free.

If your team is on pandas everywhere and considering a migration: start with one new pipeline in Polars. Don’t rewrite the world. Most teams end up using all three.