Using Scripts Helps me Realize the Drawbacks of Jupyter Notebook
Like most people, the first tool I used when started learning data science is Jupyter Notebook. Most of the online data science courses use Jupyter Notebook as a medium to teach. This makes sense because it is easier for beginners to start writing code in Jupyter Notebook’s cells than writing a script with classes and functions.
Another reason why Jupyter Notebook is such a common tool in data science is that Jupyter Notebook makes it easy to explore and plot the data. When we type ‘Shift + Enter’, we will immediately see the results of the code, which makes it easy for us to identify whether our code works or not.
However, I realized several fallbacks of Jupyter Notebook as I work with more data science projects:
- Unorganized: As my code gets bigger, it becomes increasingly difficult for me to keep track of what I write. No matter how many markdowns I use to separate the notebook into different sections, the disconnected cells make it difficult for me to concentrate on what the code does.
- Difficult to experiment: You may want to test with different methods of processing your data, choose different parameters for your machine learning algorithm to see if the accuracy increases. But every time you experiment with new methods, you need to rerun the entire notebook. This is time-consuming, especially when the processing procedure or the training takes a long time to run.
- Not ideal for reproducibility: If you want to use new data with a slightly different structure, it would be difficult to identify the source of error in your notebook.
- Difficult to debug: When you get an error in your code, it is difficult to know whether the reason for the error is the code or the change in data. If the error is in the code, which part of the code is causing the problem?
- Not ideal for production: Jupyter Notebook does not play very well with other tools. It is not easy to run the code from Jupyter Notebook while using other tools.
I knew there must be a better way to handle my code so I decided to give scripts a try. These are the benefits I found when using scripts:
The cells in Jupyter Notebook make it difficult to organize the code into different parts. With a script, we could create several small functions with each function specifies what the code does like this
Better yet, if these functions could be categorized in the same category such as functions to process the data, we could put them in the same class!
Whenever we want to process our data, we know the functions in the class
Preprocess can be used for this purpose.
When we want to experiment with a different approach to preprocess data, we could just add or remove a function by commenting out like this without being afraid to break the code! Even if we happen to break the code, we know exactly where to fix it.
We could also experiment with different parameters by changing the input of the functions. For example, if we want to see how different methods of resampling my Pandas series affect my results, we could just switch from
method_of_resample= 'average'. How neat!
With classes and functions, we could make the code general enough so that it will be able to work with other data.
For example, if we want to drop different columns in my new data, wejust need to change
columns_to_drop to a list of columns wewant to drop and the code will run smoothly!
With functions, it is easier to test whether that function produces the output we expect. We can quickly spot out where in the code we should change to produce the output we want
If all of the tests pass but there is still an error in running our code, we know the data is where we should look next.
For example, after passing the test above, I still have a TypeError when running the script, which gives me the idea that my data has null values. I just need to take care of that to run the code smoothly.
We can use different functions in multiple scripts on top of something else like this
or to add a config file to control the values of the variables. This prevents us from wasting time tracking down a specific variable in the code just to change its value.
It might not be intuitive to write code in scripts if you have just switched from Jupyter Notebook to script, but trust me, you will get used to using scripts eventually.
Once that happens, you will start to realize many benefits of the scripts over the messy Jupyter Notebook and want to write most of your code in scripts.
With that is being said, Jupyter Notebook is still useful for exploration and visualizing the data. But just be careful not to overuse the notebook, especially when you want to put your code into production.
If you don’t feel comfortable with the big change, start small.
Big changes start with small steps
Star this repo if you want to check out the codes for all of the articles I have written. Follow me on Medium to stay informed with my latest data science articles like these: