post python · 2025-12-03 · 7 min read
pandas vs Polars vs DuckDB: when each one actually wins
The “should we still use pandas?” question came up three times in the last quarter alone. The truthful answer is “it depends, and pandas is fine for most things.” But the cases where it isn’t fine are increasingly important, and Polars and DuckDB have matured to the point where picking pandas by default is no longer obvious.
This post is the decision framework I use, with code samples and the actual properties of each tool that matter in production.
The thirty-second pitch for each
pandas: the API everyone knows. Eager evaluation, in-memory NumPy-backed. Works great up to a few GB on a single machine. Has the largest ecosystem of compatible tools (matplotlib, scikit-learn, statsmodels, etc.).
Polars: built in Rust, multi-threaded, lazy by default, and stricter about types. Often substantially faster than pandas on aggregation-heavy work, with lower peak memory; Polars’s own benchmarks and the H2O.ai-style groupby suites are the right place for hard numbers. Streaming engine handles datasets larger than RAM. Different (better, IMO) API.
DuckDB: an in-process analytical database. Speaks SQL. Reads CSV, Parquet, JSON, pandas DataFrames, Polars DataFrames, Arrow tables — anything tabular. Does query optimisation. Often outperforms both on aggregations and joins.
The decision framework
Decision │ ┌──────────────────────────┼───────────────────────────┐ │ │ │ Data fits Data is bigger than Mixed sources or comfortably RAM, or you need you need SQL, in memory predictable perf or you're joining (<2 GB) (>2 GB) multiple files │ │ │ ▼ ▼ ▼ Use pandas if Use Polars Use DuckDB the team knows it (lazy mode + streaming) and ecosystem matters
↑ Use Polars instead if you care about speed or strict typesThree axes that matter:
- Data size — pandas is great below a few GB, painful above.
- Ecosystem — pandas has the biggest. Polars catching up. DuckDB different (it’s a database).
- API style — pandas is imperative, Polars is expressive (lazy, fluent), DuckDB is SQL.
pandas: still the right pick for…
import pandas as pd
df = pd.read_csv("data.csv")df["price_per_unit"] = df["price"] / df["units"]result = df.groupby("region")["price_per_unit"].mean()Where pandas wins:
- Small data (under ~2 GB) where you want minimum cognitive overhead.
- Tutorials, notebooks, and shared analyses where teammates may not know Polars.
- Plotting: matplotlib / seaborn integration is still the gold standard.
- scikit-learn pipelines that expect DataFrame-shaped input.
- Anything Stack-Overflowable. The answer to most pandas questions has been written 4x already; Polars is younger.
Where pandas falls down:
- Memory. pandas materialises every intermediate, no streaming. A
groupby().agg()over 5 GB of data needs ~15 GB of RAM. - Speed on large data. Single-threaded for most operations.
- Type coercion surprises. Mixed-type columns become
object. NaN handling is inconsistent across types. - Method chaining. Possible but ugly compared to Polars.
Polars: built different, often the right answer
The headline difference: Polars is lazy by default. You build a query plan, the engine optimises it, then .collect() executes.
import polars as pl
result = ( pl.scan_csv("data.csv") # lazy: just plans the read .with_columns(price_per_unit=pl.col("price") / pl.col("units")) .group_by("region") .agg(pl.col("price_per_unit").mean()) .collect() # actually run)Three properties that matter:
1. Lazy + optimisation. The engine sees the whole query before executing. It can push filters down to the file read, skip columns you don’t use, parallelise group-bys across cores. On a multi-core machine, this typically translates into a meaningful multiple over pandas on the same workload.
2. Strict types. Columns have a single dtype. No object fallback. Means more upfront friction (you’ll see “cannot cast utf8 to int64” errors that pandas would silently coerce), and far fewer “why is this column suddenly all NaN?” mysteries downstream.
3. Streaming engine (scan_csv + .collect(streaming=True)). Reads chunks, processes, discards. Handles datasets larger than RAM in constant memory.
Polars idioms that pandas users find surprising:
# Filterdf.filter(pl.col("age") > 30) # not df[df["age"] > 30]
# Multiple aggregations in one passdf.group_by("region").agg( pl.col("price").mean().alias("avg_price"), pl.col("price").quantile(0.95).alias("p95_price"), pl.len().alias("n_orders"),)
# Window functionsdf.with_columns( rank=pl.col("score").rank().over("region"),)Where Polars wins:
- Large data (1 GB to ~50 GB) on a single machine.
- Method-chained pipelines that read top-to-bottom.
- Type-strict workflows where silent coercion is a bug, not a feature.
- Multi-core machines (anything modern).
Where Polars loses:
- The ecosystem. scikit-learn wants pandas. matplotlib wants pandas. You’ll convert at boundaries (
df.to_pandas()), which is fast but a layer. - Stack Overflow coverage. Many Polars idioms still aren’t Googleable.
- API churn. Polars 0.x → 1.x had breaking changes; pin your version.
DuckDB: the database that thinks it’s a library
DuckDB is different. It’s not a DataFrame; it’s an analytical SQL engine that runs in-process. You don’t import a DataFrame, you query data.
import duckdb
# Query a CSV directly, no load stepresult = duckdb.sql(""" SELECT region, AVG(price / units) AS avg_price_per_unit FROM 'data.csv' GROUP BY region""").df() # convert result to pandasDuckDB reads CSVs, Parquet files, JSON, even glob patterns (SELECT * FROM 'data/*.parquet'), and the query optimiser handles joins and aggregations smartly. On many workloads it beats Polars on aggregations and is competitive on filters.
Three things DuckDB does that nothing else does as well:
1. Multi-file federated queries.
duckdb.sql(""" SELECT u.user_id, e.event_count FROM 'users/*.parquet' u JOIN (SELECT user_id, COUNT(*) as event_count FROM 'events/*.parquet' GROUP BY user_id) e ON u.user_id = e.user_id WHERE u.country = 'NL'""").df()No load step. DuckDB scans the files, pushes the join through, returns the result. Often faster than reading both into Polars and joining.
2. SQL on pandas/Polars DataFrames.
import pandas as pdimport duckdb
df = pd.read_csv("data.csv")result = duckdb.sql("SELECT region, AVG(price) FROM df GROUP BY region").df()When you have a DataFrame in memory but the query is more naturally expressed as SQL, DuckDB queries it in place. Useful for analysts who think in SQL.
3. Embedded analytical database. If your app needs an analytics DB but you don’t want to run a separate Postgres / ClickHouse, DuckDB lives in your Python process. No network hop.
Where DuckDB wins:
- Multi-file queries (a folder of Parquet files = a table).
- SQL-shaped problems (joins, window functions, aggregates).
- Analytical APIs where you want a DB without operating one.
- Data science workflows where SQL is faster to write than DataFrame-fluent code.
Where DuckDB loses:
- Imperative transforms.
df["new_col"] = df["a"] / df["b"] + custom_python(df["c"])is not its style. - Plotting doesn’t go through it; you
.df()and plot from pandas. - Streaming. DuckDB reads everything; doesn’t have a Polars-style streaming engine for larger-than-RAM. (This is changing in newer releases.)
What the shape of a comparison usually looks like
A typical aggregation workload (a few GB of Parquet, group by a couple of columns, take top-K per group) shows three predictable properties on a multi-core machine:
# pandasdf = pd.read_parquet("data/")result = (df.groupby(["country", "category"])["score"] .nlargest(10) .reset_index())# Materialises everything in memory, single-core for most of the work.# Memory and time scale roughly linearly with input size.# Polars (lazy)result = (pl.scan_parquet("data/") .group_by(["country", "category"]) .agg(pl.col("score").top_k(10)) .collect())# Lazy plan + multi-core execution. Pushes column pruning into the read.# Typically a large speedup over pandas on this shape; lower peak memory.# DuckDBresult = duckdb.sql(""" SELECT country, category, score FROM ( SELECT country, category, score, ROW_NUMBER() OVER (PARTITION BY country, category ORDER BY score DESC) AS rn FROM 'data/*.parquet' ) WHERE rn <= 10""").df()# Vectorised analytical engine + query planner. Often comparable to# or faster than Polars on aggregation-heavy SQL shapes; lower peak memory.The actual factor varies by workload, hardware, and Polars/DuckDB version. The ordering on aggregation-heavy work is what’s stable: pandas slowest, Polars and DuckDB much faster and lighter, with DuckDB usually edging Polars on SQL-shaped queries. The right place to get exact numbers for your size class is to run the workload yourself; published TPC-H and groupby benchmarks ballpark it.
What I actually do in production
- Quick analysis, ad-hoc, share with non-engineers: pandas. Default for anything below 2 GB.
- Pipelines I’m writing fresh, larger data, want speed: Polars.
- Cross-file joins, SQL is the right shape, or I want analytics-DB semantics: DuckDB.
- Production ETL on Spark cluster: PySpark, this isn’t the comparison.
I rarely use just one. A pipeline often reads with DuckDB (multi-file queries), processes the result with Polars (lazy chains), and emits a final pandas DataFrame for downstream tools. Each library is good at one thing.
What I no longer do
- Default to pandas for new data work above ~2 GB. Use Polars.
- Load everything just to count. DuckDB does
SELECT COUNT(*) FROM 'data/*.parquet'without loading. - Write nested for-loops over a DataFrame. Either vectorise (any of the three handle this), or
apply()is a smell.
Closing
Three tools, three personalities. pandas is the lowest-cognitive-overhead option for known-small data. Polars is the right default for fresh work where speed and types matter. DuckDB is the SQL-shaped weapon for cross-file joins and analytical queries. The right answer is “the one that fits the shape of your problem”, and for any non-trivial data work, knowing all three gets you 5-10x speedups for free.
If your team is on pandas everywhere and considering a migration: start with one new pipeline in Polars. Don’t rewrite the world. Most teams end up using all three.