The need for Product managers to drive business impact with machine learning is ever growing. At the time of writing this article, I was launching ML driven product/features as a Sr. Product Manager at Amazon. During this time, I’ve spent a lot of time learning and using ML concepts in my day to day job. In hopes that my experiences may inform your learning, in this series “A Product Manager’s Guide to Machine Learning”, I’m recording my take away.
This article is for those product managers, who want to go little deeper than the core ML process concepts: Set Objective, Get Data, Split Data, Train, Validate, Test, Evaluate, and Launch.
With the above mentioned ML process in the background, I’ve found that 3 core concepts to be the trunk and big branches of the ML.
The 3 core concepts are 1) Loss 2) Optimization and 3) Evaluation.
You don’t have to know the twigs and leaves of these concepts. I will use simple visuals and language to communicate these concepts on linear regression algorithms.
Disclaimer — Each of these topics are so expansive and detailed in nature, that one can write a book on each topic. This article is a simple introduction to the main ideas.
The ‘Jobs to be Done’ theory can be applied in Machine Learning too. Before we jump into the 3 core concepts, let’s take a minute to understand — What is the job of a (linear regression) algorithm? Why are we hiring the algorithm?
The purpose of a linear regression algorithm is to position a line among data points (Figure 1: blue dots). The goal is that we want to learn the information X, call it feature, has about y (call it target/label), so that we can predict a y for a new or unknown X. Here X and y are quantitative in nature.
For example: How accurately can we estimate the impact of X on y? Advertisement(X) on sales (y), number of rooms (X) on house price (y), height(X) on weight(y), etc. If there is only one feature, it is called Simple Linear Regression and we fit a line between X and Y. If there are a lot of Xs, its called Multiple Linear Regression and you fit a plane between Xs and y.
The algorithm goes through many variations of lines, as shown in Figure 2, to give us a best model. Even a kid can tell that the far right fit is the best fit.
The output of a linear regression model is an equation that captures the information Xs have about y. This happens when the model accurately learns the parameters β. Something like in Figure 3. Business needs those parameters to know how to allocate limited resources or drive the impact.
Earlier I said a kid can tell that the far right fit, in the Figure 1, is the best fit. But how? What is the intuition behind such a conclusion. That intuition can be captured by looking at the distance between the actual value (blue dot) and predicted value (a point on the line).
Greater the distance between actual and predicted values, worse the prediction. Are you saying — duh! This is called loss, a penalty of poor prediction. There are one or more types of loss for any algorithm. These are also known as loss function. The linear regression models we’ll examine here use a loss function called squared loss. The squared loss for a single example is as follows:
We want to mitigate the risk of model’s inability to produce good predictions on the unseen data, so we introduce the concepts of train and test sets. This different sets of data will then introduce the concept of variance (model generating different fit for different data sets) i.e. over-fitting, and under-fitting etc. We want to desensitize the model from picking up the peculiarities of the training set, this intent introduces us to yet another concept called regularization. Regularization builds on the sum of squared residuals, our original loss function.
- Ordinary Regression: the sum of the Squared Residuals — (1)
- Ridge Regression, also called as L2 regularization, minimizes the complexity of the model by penalizing the weight square : (1) + λ * sum of squared weights — (2)
- Lasso Regression, also called as L1 regularization, minimize the complexity of the model by making uninformative coefficients equal to zero: (1)+ λ *sum of |absolute weights| — (3)
- Elastic Net = (1) + (2) + (3)
I think you might be asking the question — There could be thousands or millions of variations in placing the line among the data points. How is this accomplished? To visualize look at Figure 5. The intent of figure 5 is to show what could be actually happening to get to the right fit.
You can find hoards of books that would fill up a small sized library on this topic. As a Product Manager, I do not need to nor I am expected to know the leaves and branches of the optimization described in these books. The big idea is simple — optimize. If you own a paper route, you optimize the route, i.e. more papers delivered in the least amount of time. If you have kids, you optimize for toys that create the least amount of mess. If you are a ML model, you optimize the fit to produce minimum loss function.
In our example, we are minimizing the squared distance between actual y and predicted y. This process of minimizing the loss can take milliseconds to days. There are different ways to optimize our quest to find the least sum of squares. That is to say, there are various optimization algorithms to accomplish the objective. For example: 1) Gradient Descent 2) Stochastic GD 3) Adagard 4) RMS Prop etc are few optimization algorithms, to name a few. By convention, most optimization algorithms are concerned with minimization.
For example, In figure 6, we can use a gradient to descend to the lowest point in the loss function and that point will become the intercept and other parameters of the output equation in Figure 3.
Once the loss is identified and reduced, we arrive at the final core concept: evaluation. Often times, this is where business owners meeting and familiarize themselves with the performance of the model to accomplish the business objective.
In linear regression, you can evaluate the model based on Mean Square Error (loss function), smaller the better, R-squared, and Adjusted R-squared, higher the better.
- R–squared measures the proportion of the variation in your dependent variable (Y) explained by your independent variables (X) for a linear regression model.
- Adjusted R–squared adjusts the statistic based on the number of independent variables in the model.
Applications of machine learning are awe-inspiring. Don’t let the math and vocabulary deter you from pursuing machine learning. As you can see the core concepts are familiar and rudimentary. As much as Product Managers need machine learning, it needs product managers who make the best use of it. I hope these core ideas will help you think of the right questions as collaborate with your ML team.