Using pandas with Python allows you to handle much more data than you could with Microsoft Excel or Google Sheets.
SQL databases are very popular for storing data, but the Python ecosystem has many advantages over SQL when it comes to expressiveness, testing, reproducibility, and the ability to quickly perform data analysis, statistics, and machine learning.
Unfortunately, if you are working locally, the amount of data that pandas can handle is limited by the amount of memory on your machine. And if you’re working in the cloud, more memory costs more money.
Regardless of where you code is running you want operations to happen quickly so you can GSD (Get Stuff Done)! 😀
If you’ve ever heard or seen advice on speeding up code you’ve seen the warning. ⚠️ Don’t prematurely optimize! ⚠️
This is good advice. But it’s also smart to know techniques so you can write clean fast code the first time. 🚀
The following are three good coding practices for any size dataset.
- Avoid nested loops whenever possible. Here’s a brief primer on Big-O notation and algorithm analysis. One
forloop nested inside another
forloop generally leads to polynomial time calculations. If you have more than a few items to search through, you’ll be waiting for a while. See a nice chart and explanation here.
- Use list comprehensions (and dict comprehensions) whenever possible in Python. Creating a list on demand is faster than load the append attribute of the list and repeatedly calling it as a function — hat tip to the Stack Overflow answer here. However, in general, don’t sacrifice clarity for speed, so be careful with nesting list comprehensions. ⚠️
- In pandas, use built-in vectorized functions. The principle is really the same as the reason for dict comprehensions. Apply a function to a whole data structure at once is much faster than repeatedly calling a function.
If you find yourself reaching for
apply, think about whether you really need to. It’s looping over rows or columns. Vectorized methods are usually faster and less code, so they are a win win. 🚀
Avoid the other pandas Series and DataFrame methods that loop over your data —
ittertuples. Use the
replace method on a DataFrame instead of any of those other options to save lots of time.
Notice that these suggestions might not hold for very small amounts of data, but in that cases, the stakes are low, so who cares. 😉
If you can, stay in pandas. 🐼
It’s a happy place. 😀
Don’t worry about these issues if you aren’t having problems and you don’t expect your data to balloon. But at some point, you’ll encounter a big dataset and then you’ll want to know what to do. Let’s see some tips.
- Use a subset of your data to explore, clean, make a baseline model if you’re doing machine learning. Solve 90% of your problems fast and save time and resources. This technique can save you so much time!
- Load only the columns that you need with the
usecolsargument when reading in your DataFrame. Less data in = win!
- Use dtypes efficiently. Downcast numeric columns to the smallest dtypes that makes sense with pandas.to_numeric(). Convert columns with low cardinality (just a few values) to a categorical dtype. Here’s a pandas guide on efficient dtypes.
- Parallelize model training in scikit-learn to use more processing cores whenever possible. By default, scikit-learn uses just one of your machine’s cores. Many computers have 4 or more cores. You can use them all for parallelizable tasks by passing the argument
n_jobs=-1when doing cross validation with GridSearchCV and many other classes.
- Save pandas DataFrames in feather or pickle formats for faster reading and writing. Hat tip to Martin Skarzynski, who links to evidence and code here.
pd.evalto speed up pandas operations. Pass the function your usual code in a string. It does the operation much faster. Here’s a chart from tests with a 100 column DataFrame.
df.query is basically same as
pd.eval, but as a DataFrame method instead of a top-level pandas function.
See the docs because there are some gotchas. ⚠️
Pandas is using numexpr under the hood. Numexpr also works with NumPy. Hat tip to Chris Conlan in his book Fast Python for pointing me [email protected] Chris’s book is an excellent read for learning how to speed up your Python code. 👍
- Use numba. Numba gives you a big speed boost if you’re doing mathematical calcs. Install numba and import it. Then use the
@numba.jitdecorator function when you need to loop over NumPy arrays and can’t use vectorized methods. It only works with only NumPy arrays. Use
.to_numpy()on a pandas DataFrame to convert it to a NumPy array.
- Use SciPy sparse matrices when it makes sense. Scikit-learn outputs sparse arrays automatically with some transformers, such as CountVectorizer. When your data is mostly 0s or missing values, you can convert columns to sparse dtypes in pandas. Read more here.
- Use Dask to parallelize the reading of datasets into pandas in chunks. Dask can also parallelize data operations across multiple machines. It mimics a subset of the pandas and NumPy APIs. Dask-ML is a sister package to parallelize machine learning algorithms across multiple machines. It mimics the scikit-learn API. Dask plays nicely with other popular machine learning libraries such as XGBoost, LightGBM, PyTorch, and TensorFlow.
- Use PyTorch with or without a GPU. You can get really big speedups by using PyTorch on a GPU, as I found in this article on sorting.
The following three packages are bleeding edge as of mid-2020. Expect configuration issues and early stage APIs. If you are working locally on a CPU, these are unlikely to fit your needs. But they all look very promising and are worth keeping an eye on. 🔭
- Do you have access to lots of cpu cores? Does your data have more than 32 columns (necessary as of mid-2020)? Then consider Modin. It mimics a subset of the pandas library to speed up operations on large datasets. It uses Apache Arrow (via Ray) or Dask under the hood. The Dask backend is experimental. Some things weren’t fast in my tests — for example reading in data from NumPy arrays was slow and memory management was an issue.
- You can use jax in place of NumPy. Jax is an open source google product that’s bleeding edge. It speeds up operations by using five things under the hood: autograd, XLA, JIT, vectorizer, and parallelizer. It works on a CPU, GPU, or TPU and might be simpler than using PyTorch or TensorFlow to get speed boosts. Jax is good for deep learning, too. It has a NumPy version but no pandas version yet. However, you could convert a DataFrame to TensorFlow or NumPy and then use jax. Read more here.
- Rapids cuDF uses Apache Arrow on GPUs with a pandas-like API. It’s an open source Python package from NVIDIA. Rapids plays nicely with Dask so you could get multiple GPUs processing data in parallel. For the biggest workloads, it should provide a nice boost.
If you want to time an operation in a Jupyter notebook, you can use
%%timeit magic commands. They both work on a single line or an entire code cell.
%time runs once and
%%timeit runs the code multiple times (the default is seven). Do check out the docs to see some subtleties.
If you are in a script or notebook you can import the time module, check the time before and after running some code, and find the difference.
When testing for time, note that different machines and software versions can cause variation. Caching will sometimes mislead if you are doing repeated tests. As with all experimentation, hold everything you can constant. 👍
Storing big data
Make sure you aren’t auto-uploading files to Dropbox, iCloud, or some other auto-backup service, unless you want to be.
Want to learn more?
Have other tips? I’d love to hear them over on Twitter. 🎉
You’ve seen how to write faster code. You’ve also seen how to deal with big data and really big data. Finally, you saw some new libraries that will likely continue to become more popular for processing big data.
I hope you’ve found this guide to be helpful. If you did, please share it on your favorite social media so other folks can find it, too. 😀
I write about Python, SQL, Docker, and other tech topics. If any of that’s of interest to you, sign up for my mailing list of awesome data science resources and read more to help you grow your skills here. 👍