Can you Package your Machine Learning Models and Pick the Best one as you need?
In machine learning, since there is no such thing as the best model, you may need to train different models and try out different parameters until you find the model that gives the highest accuracy. The bigger number of experiments, the more likely you will find the model with high performance.
But you may feel discouraged to create many experiments because:
- There are too many changes in parameters and results for you to keep track of
- Even if you can keep track of the change by carefully logging the input and output, how do you ensure to use the same model for future data?
The questions above are the motivation for reproducibility. In machine learning, reproducibility is being able to recreate a machine learning workflow to reach the same outputs as to the original work. That is when we need some tools to efficiently log the inputs and outputs as well as save our models.
Would it be nice if you could have a log of all your experiments, the dates, the information about the data, parameters, results, and model in one place like this?
My favorite tools to achieve this goal are Hydra and MLflow. Let’s discover how to:
- Create powerful configuration files to keep track of inputs with Hydra.
- Keep a log of outputs and serve models easily with MLflow.
First of all, what is a configuration file and why do we need it? The configuration file contains plain text parameters that define settings or preferences for building or running a program. In data science, config file could be used to define the data and the parameters for our data.
For example, we could use config file to log the data, hyperparameters, model, and metrics for the experiments.
To read the config file, we will use Hydra. Why Hydra? With Hydra, we can compose your configuration dynamically, enabling us to easily get the perfect configuration for each run. To install Hydra, simply run
pip install hydra-core – upgrade
Add Hydra as the decorator before our main function with the parameter is the path to our configuration file. Simply have
config or any other name as the parameter for your main function. Let’s say we want to read our training data, we could use
Want to reuse the code but changing the data? We just need to change the data path in our configuration file!
MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, and deployment. I covered how to use MLFlow as one of the tools for tuning hyperparameters here:
MLflow also allows serving our models easily and efficiently. Install MLflow with
pip install mlflow
Now our parameters, metrics, and model are saved. To access them, simply run
You will see a list of runs. Click on the run to see the information related to each run
In the artifact section at the end of the page, we could see the config files that save information about our input and the model we train.
To use the model, find the full path to the model when clicking the model’s name listed in the artifacts. Then load the model with that path.
Finally, we could combine these two powerful tools together like this
To put everything together, we use a config file to save the inputs and use Hydra to call those inputs. We use MLflow to log the outputs and model. Now we could freely create many experiments while still being able to keep track, compare, and reproduce the results!
If you still don’t understand the complete picture of how to incorporate MLflow and Hydra in your data science project, you could find an example of the workflow mentioned in this article in my NLP project.
Congratulations! You have learned how to use Hydra and MLflow for reproducibility for your machine learning model. I encourage you to further explore these two tools to create an efficient pipeline that fits your purpose. Remember
For every minute spent in organizing, an hour is earned
By organizing the structure of your code, you will save hours of debugging in your machine learning workflow. With these two tools, I find less clutter in my workflow and I hope the same for you.
I like to write about basic data science concepts and play with different data science tools. Follow me on Medium to get updated about my latest articles. You could also connect with me on LinkedIn and Twitter.
Check out my other blogs on math and data science topics: