Pandas doesn’t have multiprocessing support and it is slow with bigger datasets. There is a better tool that puts those CPU cores to work!
Pandas is one of the best tools when it comes to Exploratory Data Analysis. But this doesn’t mean that it is the best tool available for every task — like big data processing. I’ve spent so much time waiting for pandas to read a bunch of files or to aggregate them and calculate features.
Recently, I took the time and found a better tool, which made me update my data processing pipeline. I use this tool for heavy data processing — like reading multiple files with 10 gigs of data, apply filters to them and do aggregations. When I am done with heavy processing I save the result to a smaller “pandas friendly” CSV file and continue with Exploratory Data Analysis in pandas.
Download the Jupyter Notebook to follow examples.
Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love. This includes numpy, pandas and sklearn. It is open-source and freely available. It uses existing Python APIs and data structures to make it easy to switch between Dask-powered equivalents.
Dask makes simple things easy and complex things possible
Pandas vs Dask
I could go on and on describing Dask, because it has so many features, but instead, let’s look at a practical example. In my work, I usually get a bunch of files that I need to analyze. Let’s simulate my workday and create 10 files with 100K entries (each file has 196 MB).
from sklearn.datasets import make_classification
import pandas as pdfor i in range(1, 11):
print('Generating trainset %d' % i)
x, y = make_classification(n_samples=100_000, n_features=100)
df = pd.DataFrame(data=x)
df['y'] = y
df.to_csv('trainset_%d.csv' % i, index=False)
Now, let’s read those files with pandas and measure time. Pandas doesn’t have native glob support so we need to read files in a loop.
%%timeimport globdf_list = 
for filename in glob.glob('trainset_*.csv'):
df_ = pd.read_csv(filename)
df = pd.concat(df_list)
It took pandas 16 seconds to read files.
CPU times: user 14.6 s, sys: 1.29 s, total: 15.9 s
Wall time: 16 s
Now, imagine if those files would be 100 times bigger — you couldn’t even read them with pandas.
Dask can process data that doesn’t fit into memory by breaking it into blocks and specifying task chains. Let’s measure how long Dask needs to load those files.
import dask.dataframe as df%%time
df = dd.read_csv('trainset_*.csv')CPU times: user 154 ms, sys: 58.6 ms, total: 212 ms
Wall time: 212 ms
Dask needed 154 ms! How is that even possible? Well, it is not. Dask has delayed execution paradigm. It only calculates things when it needs them. We define the execution graph so Dask can then optimize the execution of the tasks. Let’s repeat the experiment — also notice that Dask’s read_csv function takes glob natively.
%%timedf = dd.read_csv('trainset_*.csv').compute()CPU times: user 39.5 s, sys: 5.3 s, total: 44.8 s
Wall time: 8.21 s
The compute function forces Dask to return the result. Dask read files twice as fast than pandas.
Dask natively scales Python
Pandas vs Dask CPU usage
Does Dask use all of the cores you paid for? Let’s compare CPU usage between pandas and Dask when reading files — the code is the same as above.
In the screen recordings above the difference in multiprocessing is obvious with pandas and Dask when reading files.
Dask’s DataFrame is composed of multiple pandas DataFrames, which are split by index. When we execute read_csv with Dask, multiple processes read a single file.
We can even visualize the execution graph.
exec_graph = dd.read_csv('trainset_*.csv')
You might be thinking if Dask is so great, why not ditch pandas all together. Well, it is not that simple. Only certain functions from pandas are ported to Dask. Some of them are hard to parallelize, like sorting values and setting indexes on unsorted columns. Dask is not a silver bullet — usage of Dask is recommended only for datasets that don’t fit in the main memory. As Dask is built on top of pandas, operations that were slow in pandas, stay slow in Dask. Like I mentioned before, Dask is a useful tool in the data pipeline process, but it doesn’t replace other libraries.
Dask is recommended only for datasets that don’t fit in the main memory
To install Dask simply run:
python -m pip install "dask[complete]"
This will install the whole Dask library.
I’ve only touched the surface of Dask library in this blog post. If you would like to dive deeper check amazing Dask tutorials and Dask’s DataFrame documentation. Interested in which DataFrame functions are supported in Dask? Check DataFrame API.
I write extensively about data analysis with pandas. In case you’ve missed my other posts:
- 5 lesser-known pandas tricks- Exploratory Data Analysis with pandas- How NOT to write pandas code [paid]- 5 Gotchas With Pandas- Pandas tips that will save you hours of head-scratching- Display Customizations for pandas Power Users- 5 New Features in pandas 1.0 You Should Know About- pandas analytics server- Pandas presentation tips I wish I knew earlier- Pandas analysis of coronavirus pandemic- Sports analysis with Pandas- 3 hidden mistakes with pandas- Pandas Pivot — The Ultimate Guide