Pandas vs Polars - A comparison on File I/O
Integrations

Pandas vs Polars - A comparison on File I/O

Wes Poulsen
Wes Poulsen

For more than a decade, Pandas has been the gold standard of dataframe libraries in Python and the go-to for tasks such as data manipulation with tabular datasets. Most (if not all) connectors to third party applications make use of the Pandas API when dealing with datasets. The de-facto has been (and probably will continue to be) Pandas.

Enter Polars. Polars is a "Lightning-fast DataFrame library for Rust and Python" built on top of Apache Arrow memory format. Polars claim to fame is speed, as evidenced by the H2O.ai benchmarks here which mainly target database operations such as joins, group by's, and aggregations. However, I am particularly interested in the I/O performance of Polars compared to Pandas. In this article, we will be exploring how Polars and Pandas fare by reading and writing a large dataset that should fit in memory (~1GB)  and one that does not fit in memory (~50GB). Let's get started!

Datasets

The specifications for the datasets used in the analysis are provided here:

medium.csv
  • ~1GB
  • 2.2 Million Rows, 27 Columns
large.csv
  • ~50GB
  • 1.2 Billion Rows, 3 Columns

These datasets were generated by modifying the following bash script:

for i in {0..10000000}; do echo $(uuidgen),$RANDOM,$(head -c 12 /dev/urandom | base64); done >> file.csv

Setup

Run the following to install the libraries and needed dependencies

pip install pandas
pip install polars

In memory dataset

Let's compare how Pandas and Polars perform at reading and writing a dataframe that fits in memory. In order to capture the execution time, I will be using the %%timeit jupyter magic. If you wish to follow along not in Jupyter, then could also create a script and run the shell command time before calling the script.

Reading

import pandas as pd
import polars as pl
import numpy as np
%%timeit
pandas_df = pd.read_csv("medium.csv")

# Output: 13.5 s ± 528 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

On my machine (Macbook Pro M1 chip), this took 13.5 seconds to read in. Let's see how Polars compares to this.

%%timeit
polars_df = pl.read_csv("medium.csv")
# Output: 350 ms ± 7.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Quite a dramatic difference when reading in the same dataset, 13.5 seconds versus 350 milliseconds!

Writing

Now let's take that CSV that we read in and write it back out as a new file.

Pandas:

%%timeit
pandas_df.to_csv("pandas_data.csv")

# Output: 16.1 s ± 186 ms per loop

Polars:

%%timeit
df_polars.write_csv("polars_data.csv")

#Output: 951 ms ± 37.8 ms per loop

Another stark contrast where Pandas takes just over 16 seconds and Polars takes just under 1 second.

Large file operations

Let's turn our focus to a dataset that cannot be read entirely into memory. I have a CSV file that is nearly 50 GB large and consists of three columns:

  • Id (a random UUID)
  • Letter (a random letter from A-Z)
  • Number (A random integer)

I'd like to query this dataset to only have Letter's equal to 'A' and Number's equal to 1.

Executing this in Polars is fairly seamless, as the scan_csv function allows for lazy execution and is ideal for larger than memory datasets.

%%timeit
q = (pl.scan_csv('large.csv')
     .filter((pl.col('letter') == 'A') &
             (pl.col('number') == 1))
    )
q.collect().write_csv("tmp/polars/polars_results.csv")
# Output: 1min 13s ± 658 ms per loop

In order to trim down a 50GB file and write the output to a new csv, this took roughly 1 minute and 13 seconds.

If we were to use Pandas (without additional libraries such as Dask), we would have to resort to a chunking method to iteratively filter the DataFrame.

def filter_csv(csv_file:str, chunksize = 100000):
    file_name = "pandas_results.csv"
    return_path = f'tmp/pandas/{file_name}'

    return_dataframe = pd.DataFrame()
    n_chunks = 0
    for i, chunk in enumerate(pd.read_csv(csv_file, chunksize = chunksize, dtype={'number': np.int32})):
        if i == 0:
            filtered = chunk.loc[(chunk.letter == 'A') & (chunk.number) == 1]
            filtered.to_csv(return_path)
        else:
            filtered = chunk.loc[(chunk.letter == 'A') & (chunk.number) == 1]
            filtered.to_csv(return_path, mode = 'a')
        n_chunks = i

filter_csv('large.csv')

Since Pandas is unable to lazily scan a large file like Polars, we had to read in chunks and then append the output to a target file. This took 9 minutes and 53 seconds to run on my machine.

Conclusion

To summarize the results for each:

Library Read 1 GB File Write 1 GB File Read 50 GB File and Filter
Pandas 13.5s 16.1s 9min 53s
Polars 350ms 951ms 1min 13s

In each IO operation, Polars faired better than Pandas, and in significant fashion. Granted, there are ways to improve the speed of Pandas  using other libraries that we did not demonstrate here, but nonetheless it is pretty impressive that Polars is so fast right out of the box.

Will it be dethroning Pandas any time soon? Definitely not. Is it a legitimate alternative for some use cases with tremendous upside? You bet!