Learn how to automate Data Science code using Jenkins
Let’s paint a scenario, you’re working on a Data Science project and at first, you had a model accuracy of 80%, you deploy that application to production serving it as an API using Flask. Then some few days you decide to pick up the project later on, after tuning some of the parameters and adding some more data, you had better accuracy than the previous model built. Now you plan to deploy this model and you have to go through the trouble of building, testing and deploying the model to production again which is a lot of work. In this article, I will show you how we can use a powerful tool called Jenkins to automate this process.
Jenkins is a free and open-source automation server. It helps automate the parts of software development related to building, testing, and deploying, facilitating continuous integration and continuous delivery — Wikipedia
With Jenkins, you can automate and accelerate software delivery processes throughout the entire lifecycle using a vast majority of plugins. For example, you can set up Jenkins to automatically detect code commit in a repository and automatically trigger commands either building a Docker image from a Dockerfile, running unit tests, push an image to a container registry or deploy it to the production server without manually doing anything. I’ll be explaining some basic concept we need to know in order to perform some automation in our Data Science project.
- It is Open Source
- Easy to use and install
- A large number of plugins that fit into a DevOps environment
- Spend more time on your code and less time on deployment
- Massive community
Jenkins support installation across cross platforms whether if you’re a Windows, Linux or Mac user. You can even install it on a cloud server that supports either PowerShell or Linux instances. To install Jenkins, you can refer to the documentation here.
Jenkins has a lot of amazing features and some are beyond the scope of this article, to get the hang of Jenkins you can check the documentation.
Before we jump into the practical side of things, there are some terms I want to explain which is very important, some of which are:
A Jenkins job simply refers to runnable tasks that are controlled by Jenkins. For instance, you can assign a job to Jenkins to perform some certain operations like run “Hello World”, perform unit and integration testing etc. Creating Job is very easy in Jenkins but in a software environment, you may not build a single job but instead, you’ll be doing what is referred to as a pipeline.
A pipeline is running a collection of jobs following a particular order or sequence, let me explain this with an example. Suppose I am developing an application on Jenkins and I want to pull the code from a code repository, build the application, test and deploy it to a server. To do this, I will create four jobs to perform each of those processes. So, the first job(Job 1) will pull the code from the repository, the second job(Job 2) would be for building the application, third job(Job 3) would perform unit and integration tests and the fourth job(Job 4) for deploying the code to production. I can use the Jenkins build pipeline plugin to perform this task. After creating the jobs and chaining them in a sequence, the build plugin will then run each of these jobs as a pipeline.
Types of Jenkins Pipeline:
- Declarative pipeline: This is a feature that supports the pipeline as a code concept. It makes the pipeline code easier to read and write. This code is written in a Jenkinsfile which can be checked into a source control management system such as Git.
- Scripted pipeline: This is one of the old ways of writing the code. Using this method, the pipeline code is written on the Jenkins User Interface instance instead of writing it in a file. Though both these pipelines perform the same function and they use the same scripting language(Groovy).
After talking about the major concepts, let’s build a simple mini project and automate it with Jenkins.
This project contains a trained Machine Learning model that detects sentiments relating to suicidal tweets from twitter which I deployed as an API using flask. I structured my Jenkins pipeline to:
Pull changes from the repository when a commit is made >>> Build Docker Image >>> Push Built Image to DockerHub >>> Remove Unused Docker Images.
Startup a Jenkins server and install Git, Docker, Pipeline and build plugins and also install Git and Docker in your instance also. For this article, I used Jenkins on an AWS EC2 instance.
Push the code to a repository, in this article I used Github. You can find the code for this article here.
My working directory:
│ └── data_preprocessing.py
│ └── model.pkl
Then we need to tell Jenkins to start building the pipeline whenever a change is made in the code repository. To do this, you need to add the Jenkins webhook to the GitHub repository for Github to communicate to Jenkins if there’s a change in the code. To do this:
Click on settings in your code repository
Then navigate to webhooks and click on Add Webhook
Then use the public DNS or public IP of your Jenkins server and add “/github-webhook/” at the end, then select application/json as the content type. In the “Which events would you like to trigger this webhook?”, select the individual events and select pushes only and click on add webhook.
Now, head over to Jenkins and click on New Item
Enter an Item name, Select Pipeline and click OK
Scroll down from General to Build Triggers and select Github hook trigger for GitScm polling. What that simply does is that helps us build our pipeline whenever there’s a commit in our code
Then Scroll to Pipeline and in the Definition select Pipeline script from SCM from the dropdown. In the SCM(Source Code Management) select Git and put in the Repository URL, if your repo is private you need to create a credential using your username and password. In the Branch Specifier, you can select master or any branch your code repository is in and the Script Path select Jenkinsfile, the Jenkinsfile is where I wrote my pipeline for the project. Then select Save.
Now let’s build the pipeline…
It ran through all the stages and the build was successful
And if you check your Docker Hub repository you’ll find the docker image there.
Now let us make a little change in our code repository and push
Now if you go back to Jenkins you’ll see that it has already detected the changes made already and it automatically triggers another build.
And that’s how you can use Jenkins to automate these three processes.
Thanks for reading 😃