post python · 2025-12-03 · 7 min read

pandas vs Polars vs DuckDB: when each one actually wins

#python#pandas#polars#duckdb#data-analysis

The “should we still use pandas?” question came up three times in the last quarter alone. The truthful answer is “it depends, and pandas is fine for most things.” But the cases where it isn’t fine are increasingly important, and Polars and DuckDB have matured to the point where picking pandas by default is no longer obvious.

This post is the decision framework I use, with code samples and the actual properties of each tool that matter in production.

The thirty-second pitch for each

pandas: the API everyone knows. Eager evaluation, in-memory NumPy-backed. Works great up to a few GB on a single machine. Has the largest ecosystem of compatible tools (matplotlib, scikit-learn, statsmodels, etc.).

Polars: built in Rust, multi-threaded, lazy by default, and stricter about types. Often substantially faster than pandas on aggregation-heavy work, with lower peak memory; Polars’s own benchmarks and the H2O.ai-style groupby suites are the right place for hard numbers. Streaming engine handles datasets larger than RAM. Different (better, IMO) API.

DuckDB: an in-process analytical database. Speaks SQL. Reads CSV, Parquet, JSON, pandas DataFrames, Polars DataFrames, Arrow tables — anything tabular. Does query optimisation. Often outperforms both on aggregations and joins.

The decision framework

                                 Decision
                                    │
        ┌──────────────────────────┼───────────────────────────┐
        │                          │                           │
   Data fits           Data is bigger than            Mixed sources or
   comfortably         RAM, or you need               you need SQL,
   in memory           predictable perf              or you're joining
   (<2 GB)             (>2 GB)                        multiple files
        │                          │                           │
        ▼                          ▼                           ▼
   Use pandas if              Use Polars                  Use DuckDB
   the team knows it          (lazy mode + streaming)
   and ecosystem matters

         ↑
   Use Polars instead if
   you care about speed
   or strict types

Three axes that matter:

Data size — pandas is great below a few GB, painful above.
Ecosystem — pandas has the biggest. Polars catching up. DuckDB different (it’s a database).
API style — pandas is imperative, Polars is expressive (lazy, fluent), DuckDB is SQL.

pandas: still the right pick for…

import pandas as pd

df = pd.read_csv("data.csv")
df["price_per_unit"] = df["price"] / df["units"]
result = df.groupby("region")["price_per_unit"].mean()

Where pandas wins:

Small data (under ~2 GB) where you want minimum cognitive overhead.
Tutorials, notebooks, and shared analyses where teammates may not know Polars.
Plotting: matplotlib / seaborn integration is still the gold standard.
scikit-learn pipelines that expect DataFrame-shaped input.
Anything Stack-Overflowable. The answer to most pandas questions has been written 4x already; Polars is younger.

Where pandas falls down:

Memory. pandas materialises every intermediate, no streaming. A groupby().agg() over 5 GB of data needs ~15 GB of RAM.
Speed on large data. Single-threaded for most operations.
Type coercion surprises. Mixed-type columns become object. NaN handling is inconsistent across types.
Method chaining. Possible but ugly compared to Polars.

Polars: built different, often the right answer

The headline difference: Polars is lazy by default. You build a query plan, the engine optimises it, then .collect() executes.

import polars as pl

result = (
    pl.scan_csv("data.csv")  # lazy: just plans the read
        .with_columns(price_per_unit=pl.col("price") / pl.col("units"))
        .group_by("region")
        .agg(pl.col("price_per_unit").mean())
        .collect()  # actually run
)

Three properties that matter:

1. Lazy + optimisation. The engine sees the whole query before executing. It can push filters down to the file read, skip columns you don’t use, parallelise group-bys across cores. On a multi-core machine, this typically translates into a meaningful multiple over pandas on the same workload.

2. Strict types. Columns have a single dtype. No object fallback. Means more upfront friction (you’ll see “cannot cast utf8 to int64” errors that pandas would silently coerce), and far fewer “why is this column suddenly all NaN?” mysteries downstream.

3. Streaming engine (scan_csv + .collect(streaming=True)). Reads chunks, processes, discards. Handles datasets larger than RAM in constant memory.

Polars idioms that pandas users find surprising:

# Filter
df.filter(pl.col("age") > 30)  # not df[df["age"] > 30]

# Multiple aggregations in one pass
df.group_by("region").agg(
    pl.col("price").mean().alias("avg_price"),
    pl.col("price").quantile(0.95).alias("p95_price"),
    pl.len().alias("n_orders"),
)

# Window functions
df.with_columns(
    rank=pl.col("score").rank().over("region"),
)

Where Polars wins:

Large data (1 GB to ~50 GB) on a single machine.
Method-chained pipelines that read top-to-bottom.
Type-strict workflows where silent coercion is a bug, not a feature.
Multi-core machines (anything modern).

Where Polars loses:

The ecosystem. scikit-learn wants pandas. matplotlib wants pandas. You’ll convert at boundaries (df.to_pandas()), which is fast but a layer.
Stack Overflow coverage. Many Polars idioms still aren’t Googleable.
API churn. Polars 0.x → 1.x had breaking changes; pin your version.

DuckDB: the database that thinks it’s a library

DuckDB is different. It’s not a DataFrame; it’s an analytical SQL engine that runs in-process. You don’t import a DataFrame, you query data.

import duckdb

# Query a CSV directly, no load step
result = duckdb.sql("""
    SELECT region, AVG(price / units) AS avg_price_per_unit
    FROM 'data.csv'
    GROUP BY region
""").df()  # convert result to pandas

DuckDB reads CSVs, Parquet files, JSON, even glob patterns (SELECT * FROM 'data/*.parquet'), and the query optimiser handles joins and aggregations smartly. On many workloads it beats Polars on aggregations and is competitive on filters.

Three things DuckDB does that nothing else does as well:

1. Multi-file federated queries.

duckdb.sql("""
    SELECT u.user_id, e.event_count
    FROM 'users/*.parquet' u
    JOIN (SELECT user_id, COUNT(*) as event_count FROM 'events/*.parquet' GROUP BY user_id) e
    ON u.user_id = e.user_id
    WHERE u.country = 'NL'
""").df()

No load step. DuckDB scans the files, pushes the join through, returns the result. Often faster than reading both into Polars and joining.

2. SQL on pandas/Polars DataFrames.

import pandas as pd
import duckdb

df = pd.read_csv("data.csv")
result = duckdb.sql("SELECT region, AVG(price) FROM df GROUP BY region").df()

When you have a DataFrame in memory but the query is more naturally expressed as SQL, DuckDB queries it in place. Useful for analysts who think in SQL.

3. Embedded analytical database. If your app needs an analytics DB but you don’t want to run a separate Postgres / ClickHouse, DuckDB lives in your Python process. No network hop.

Where DuckDB wins:

Multi-file queries (a folder of Parquet files = a table).
SQL-shaped problems (joins, window functions, aggregates).
Analytical APIs where you want a DB without operating one.
Data science workflows where SQL is faster to write than DataFrame-fluent code.

Where DuckDB loses:

Imperative transforms. df["new_col"] = df["a"] / df["b"] + custom_python(df["c"]) is not its style.
Plotting doesn’t go through it; you .df() and plot from pandas.
Streaming. DuckDB reads everything; doesn’t have a Polars-style streaming engine for larger-than-RAM. (This is changing in newer releases.)

What the shape of a comparison usually looks like

A typical aggregation workload (a few GB of Parquet, group by a couple of columns, take top-K per group) shows three predictable properties on a multi-core machine:

# pandas
df = pd.read_parquet("data/")
result = (df.groupby(["country", "category"])["score"]
            .nlargest(10)
            .reset_index())
# Materialises everything in memory, single-core for most of the work.
# Memory and time scale roughly linearly with input size.

# Polars (lazy)
result = (pl.scan_parquet("data/")
    .group_by(["country", "category"])
    .agg(pl.col("score").top_k(10))
    .collect())
# Lazy plan + multi-core execution. Pushes column pruning into the read.
# Typically a large speedup over pandas on this shape; lower peak memory.

# DuckDB
result = duckdb.sql("""
    SELECT country, category, score
    FROM (
        SELECT country, category, score,
               ROW_NUMBER() OVER (PARTITION BY country, category ORDER BY score DESC) AS rn
        FROM 'data/*.parquet'
    )
    WHERE rn <= 10
""").df()
# Vectorised analytical engine + query planner. Often comparable to
# or faster than Polars on aggregation-heavy SQL shapes; lower peak memory.

The actual factor varies by workload, hardware, and Polars/DuckDB version. The ordering on aggregation-heavy work is what’s stable: pandas slowest, Polars and DuckDB much faster and lighter, with DuckDB usually edging Polars on SQL-shaped queries. The right place to get exact numbers for your size class is to run the workload yourself; published TPC-H and groupby benchmarks ballpark it.

What I actually do in production

Quick analysis, ad-hoc, share with non-engineers: pandas. Default for anything below 2 GB.
Pipelines I’m writing fresh, larger data, want speed: Polars.
Cross-file joins, SQL is the right shape, or I want analytics-DB semantics: DuckDB.
Production ETL on Spark cluster: PySpark, this isn’t the comparison.

I rarely use just one. A pipeline often reads with DuckDB (multi-file queries), processes the result with Polars (lazy chains), and emits a final pandas DataFrame for downstream tools. Each library is good at one thing.

What I no longer do

Default to pandas for new data work above ~2 GB. Use Polars.
Load everything just to count. DuckDB does SELECT COUNT(*) FROM 'data/*.parquet' without loading.
Write nested for-loops over a DataFrame. Either vectorise (any of the three handle this), or apply() is a smell.

Closing

Three tools, three personalities. pandas is the lowest-cognitive-overhead option for known-small data. Polars is the right default for fresh work where speed and types matter. DuckDB is the SQL-shaped weapon for cross-file joins and analytical queries. The right answer is “the one that fits the shape of your problem”, and for any non-trivial data work, knowing all three gets you 5-10x speedups for free.

If your team is on pandas everywhere and considering a migration: start with one new pipeline in Polars. Don’t rewrite the world. Most teams end up using all three.