Skip to content
Generic filters
Exact matches only

17 Strategies for Dealing with Data, Big Data, and Even Bigger Data | by Jeff Hale | Aug, 2020

If you’ve ever heard or seen advice on speeding up code you’ve seen the warning. ⚠️ Don’t prematurely optimize! ⚠️

Getting after it! Source:
  1. Use list comprehensions (and dict comprehensions) whenever possible in Python. Creating a list on demand is faster than load the append attribute of the list and repeatedly calling it as a function — hat tip to the Stack Overflow answer here. However, in general, don’t sacrifice clarity for speed, so be careful with nesting list comprehensions. ⚠️
  2. In pandas, use built-in vectorized functions. The principle is really the same as the reason for dict comprehensions. Apply a function to a whole data structure at once is much faster than repeatedly calling a function.

If you can, stay in pandas. 🐼

It’s a happy place. 😀

Like millions of grains of sand. Source:
  1. Load only the columns that you need with the usecols argument when reading in your DataFrame. Less data in = win!
  2. Use dtypes efficiently. Downcast numeric columns to the smallest dtypes that makes sense with pandas.to_numeric(). Convert columns with low cardinality (just a few values) to a categorical dtype. Here’s a pandas guide on efficient dtypes.
  3. Parallelize model training in scikit-learn to use more processing cores whenever possible. By default, scikit-learn uses just one of your machine’s cores. Many computers have 4 or more cores. You can use them all for parallelizable tasks by passing the argument n_jobs=-1 when doing cross validation with GridSearchCV and many other classes.
  4. Save pandas DataFrames in feather or pickle formats for faster reading and writing. Hat tip to Martin Skarzynski, who links to evidence and code here.
  5. Use pd.eval to speed up pandas operations. Pass the function your usual code in a string. It does the operation much faster. Here’s a chart from tests with a 100 column DataFrame.
Image from this good article on the topic by Tirthajyoti Sarkar
Even more data! Source:
  1. Use SciPy sparse matrices when it makes sense. Scikit-learn outputs sparse arrays automatically with some transformers, such as CountVectorizer. When your data is mostly 0s or missing values, you can convert columns to sparse dtypes in pandas. Read more here.
  2. Use Dask to parallelize the reading of datasets into pandas in chunks. Dask can also parallelize data operations across multiple machines. It mimics a subset of the pandas and NumPy APIs. Dask-ML is a sister package to parallelize machine learning algorithms across multiple machines. It mimics the scikit-learn API. Dask plays nicely with other popular machine learning libraries such as XGBoost, LightGBM, PyTorch, and TensorFlow.
  3. Use PyTorch with or without a GPU. You can get really big speedups by using PyTorch on a GPU, as I found in this article on sorting.
Keep an eye on them! Source:
  1. You can use jax in place of NumPy. Jax is an open source google product that’s bleeding edge. It speeds up operations by using five things under the hood: autograd, XLA, JIT, vectorizer, and parallelizer. It works on a CPU, GPU, or TPU and might be simpler than using PyTorch or TensorFlow to get speed boosts. Jax is good for deep learning, too. It has a NumPy version but no pandas version yet. However, you could convert a DataFrame to TensorFlow or NumPy and then use jax. Read more here.
  2. Rapids cuDF uses Apache Arrow on GPUs with a pandas-like API. It’s an open source Python package from NVIDIA. Rapids plays nicely with Dask so you could get multiple GPUs processing data in parallel. For the biggest workloads, it should provide a nice boost.

Timing operations

If you want to time an operation in a Jupyter notebook, you can use
%time or %%timeit magic commands. They both work on a single line or an entire code cell.

Storing big data

GitHub’s maximum file size is 100MB. You can use Git Large File Storage extension if you want to version large files with GitHub.

Want to learn more?

The pandas docs have sections on enhancing performance and scaling to large datasets. Some of these ideas are adapted from those sections.

You’ve seen how to write faster code. You’ve also seen how to deal with big data and really big data. Finally, you saw some new libraries that will likely continue to become more popular for processing big data.

data awesome email signup form