Skip to content
Generic filters
Exact matches only

An Introduction to Gradient Descent and Backpropagation

Abhijit Roy

Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to do so.

We, as humans can study data to find the behavior and predict something based on the behavior, but a machine can’t really operate like us. So, in most cases, it tries to learn from already established examples. Say, for a classic classification problem, we have a lot of examples from which the machine learns. Each example is a particular circumstance or description which is depicted by a combination of features and their corresponding labels. In our real-world, we have a different description for every different object and, we know these different objects by different names. For example, cars and bikes are just two object names or two labels. They have different descriptions like the number of wheels is two for a bike and four for a car. So, the number of wheels can be used to differentiate between a car and a bike. It can be a feature to differentiate between these two labels.

Every common aspect of the description of different objects which can be used to differentiate it from one another is fit to be used as a feature for the unique identification of a particular object among the others. Similarly, we can assume, the age of a house, the number of rooms and the position of the house will play a major role in deciding the costing of a house. This is also very common in the real world. So, these aspects of the description of the house can be really useful for predicting the house price, as a result, they can be really good features for such a problem.

In machine learning, we have mainly two types of problems, classification, and regression. The identification between a car and a bike is an example of a classification problem and the prediction of the house price is a regression problem.

We have seen for any type of problem, we basically depend upon the different features corresponding to an object to reach a conclusion. The machine does a similar thing to learn. It also depends on the different features of objects to reach a conclusion. Now, in order to differentiate between a car and a bike, which feature will you value more, the number of wheels or the maximum speed or the color? The answer is obviously first the number of wheels, then the maximum speed, and then the color. The machine does the same thing to understand which feature to value most, it assigns some weights to each feature, which helps it understand which feature is most important among the given ones.

Now, it tries to devise a formula, like say for a regression problem,

Equation 1

Here w1,w2, w3 are the weights of there corresponding features like x1,x2, x3 and b is a constant called the bias. Its importance is that it gives flexibility. So, using such an equation the machine tries to predict a value y which may be a value we need like the price of the house. Now, the machine tries to perfect its prediction by tweaking these weights. It does so, by comparing the predicted value y with the actual value of the example in our training set and using a function of their differences. This function is called a loss function.

Equation 2

The machine tries to decrease this loss function or the error, i.e tries to get the prediction value close to the actual value.

Gradient descent for MSE

In this diagram, above we see our loss function graph. If we observe we will see it is basically a parabolic shape or a convex shape, it has a specific global minimum which we need to find in order to find the minimum loss function value. So, we always try to use a loss function which is convex in shape in order to get a proper minimum. Now, we see the predicted results depend on the weights from the equation. If we replace equation 1 in equation 2 we obtain this graph, with weights in X-axis and Loss on Y-axis.

Initially, the model assigns random weights to the features. So, say it initializes the weight=a. So, we can see it generates a loss which is far from the minimum point L-min.

Now, we can see that if we move the weights more towards the positive x-axis we can optimize the loss function and achieve minimum value. But, how will the machine know? We need to optimize weight to minimize error, so, obviously, we need to check how the error varies with the weights. To do this we need to find the derivative of the Error with respect to the weight. This derivative is called Gradient.

Gradient = dE/dw

Where E is the error and w is the weight.

Let’s see how this works. Say, if the loss increases with an increase in weight so Gradient will be positive, So we are basically at the point C, where we can see this statement is true. If loss decreases with an increase in weight so gradient will be negative. We can see point A, corresponds to such a situation. Now, from point A we need to move towards positive x-axis and the gradient is negative. From point C, we need to move towards negative x-axis but the gradient is positive. So, always the negative of the Gradient shows the directions along which the weights should be moved in order to optimize the loss function. So, this way the gradient guides the model whether to increase or decrease weights in order to optimize the loss function.

The model found which way to move, now the model needs to find by how much it should move the weights. This is decided by a parameter called Learning Rate denoted by Alpha. the diagram we see, the weights are moved from point A to point B which are at a distance of dx.

dx = alpha * |dE/dw|

So, the distance to move is the product of learning rate parameter alpha and the magnitude of change in error with a change in weight at that point.

Now, we need to decide the Learning Rate very carefully. If it is very large the values of weights will be changed with a great amount and it would overstep the optimal value. If it is very low it takes tiny steps and takes a lot of steps to optimize. The updated weights are changed according to the following formula.

w=w — alpha * |dE/dw|

where w is the previous weight.

With each epoch, the model moves the weights according to the gradient to find the best weights.

Now, this is a loss optimization for a particular example in our training dataset. Our dataset contains thousands of such examples, so it will take a huge time to find optimal weights for all. Experiments have shown that if we optimize on only one sample of our training set, the weight optimization is good enough for the whole dataset. So, depending upon the methods we have different types of gradient descent mechanisms.

Gradient Descent Methods

  1. Batch Gradient Descent: When we train the model to optimize the loss function using the mean of all the individual losses in our whole dataset, it is called Batch Gradient Descent.
  2. Mini-Batch Gradient Descent: Now, as we discussed batch gradient descent takes a lot of time and is therefore somewhat inefficient. If we look at SGD, it is trained using only 1 example. So, how good do you think a baby will learn if it is shown only one bike and told to learn about all other bikes? It’s simple its decision will be somewhat biased to the peculiarities of the shown example. So, it is the same for the SGD, there is a possibility that the model may get too biased with the peculiarity of that particular example. So, we use the mean of a batch of 10–1000 examples to check the optimize the loss in order to deal with the problems.

The Maths

This equation shows the change in error with a change output prediction for E= MSE.


So, this is pretty clear from basic maths. Here ‘i’ can be any integer from 0 to the number of features-1.

According to the problem, we need to find the dE/dwi0, i.e the change in error with the change in the weights.

Now, from chain rule, we can tell the following,

So, we know both the values from the above equations.

This is the final change in Error with the weights. Now, let’s look for updating the new weights.

I missed a few notations here, Y output and Y pred are the same.

These are the new weight values updated.

We also need to update the bias value also. It is done in a similar manner.

Linear classification

In other words, a problem like this where the two classes, can easily be separated using drawing a straight line which we can easily devise using equation 1.

Now, imagine doing so, for the following graph.

Non-linearly related data

Will it be possible to classify the points using a normal linear model? The answer is no. Well, one thing to note is we can solve these types of problems using feature crossing and creating linear features from these non-linear ones. These are used in the kernel methods of machine learning. We won’t be talking about it though as it is out of scope for this blog. But, in all those cases we need to tell the machine how to devise that feature that can be easily used to convert the non-linear problem to a linear one.

In our daily lives, we usually face non-linear problems only, so each time it is hard for us to devise the feature crossing for the classification of the following classes. Here is where the neural networks are used. Neural networks are capable of coming up with a non-linear equation that is fit to serve as a boundary between non-linear classes.

The way the Neural Network achieve such non-linear equations is through activation functions. These activation functions are the units of non-linearity. They are used at every layer in a Neural Network. Now, in neural networks, we stack such layers one over the others. Finally, we obtain complex functions using cascaded functions like f(f(x)).

The common types of activation function are:

  1. ReLU: f(x)= max(0,x)
  2. Sigmoid:

3. Tanh

Why is Backpropagation Required?

The loss for Neural Networks.

Now, as we see in the graph the loss function may look something like this. As we can see it has two minima, a local one and a global one. So, if we somehow end up in the local one we will end up in a suboptimal state. So, here the point where the weights initialize matters. For example, if the weights initialize to somewhere near x1 and there is a high chance we will get stuck at the local minima, which is not the same with normal MSE.

Secondly, Neural networks are of different structures.

Neural Network diagram.

So, in neural nets the result Y-output is dependent on all the weights of all the edges. So, the error is obtained at the last output node and then we need to change w-12 and w-13 accordingly. So, we need to backpropagate the error all the way to the input node from the output node.

So, let’s say,

Wij is the weight of the edge from the output of the ith node to the input of the jth node. Now, here the x is the input to every node. y is the output from every node. Except for the input node, for all nodes,


where F is the activation function.

For Input node,


Now, we can see, the hidden layer nodes have a function F1 but in the output layer, it is F2. The F1 is usually ReLU and F2 is usually a Sigmoid.

So for optimization of weights, we need to know the dE /dWij for every Wij in the network.

For this, we also need to, find the dE/dXi and dE/dYi for every node in the network.

Forward Propagation

For example,

So, generalize,

We can use these formulas.

We will try to trace a few nodes.

Here, we can trace the paths or routes of the inputs and outputs of every node very clearly. Now, we will see the cascading functions building.

In the Y4 and Y5, we can see the cascading of the non-linear activation function, to create the classifier equation. The more we stack up the layers, the more cascading occurs, the more our classifier function becomes complex.

Maths for Backpropagation

The error generated is backpropagated from the through the same nodes and the same edges through which forward propagation takes place and reaches from input edges from the output node.

The first step, the output node,

This is the derivative of the error with respect to the Y output at the final node.

From chain rule.

Now considering F2 as sigmoid function,

For sigmoid, f ‘ (x) has that property.

Now, we need to find dE/dW56

So, here also we use the chain rule. We obtain the values:

We will try this for two more layers and try to generalize a formula.

We try to calculate dE/ dY5 so that we could move to the next level.

We obtain both dE/dY5 and dE/dY4. Now we go for the change in error for a change in input for node 5 and node 4.

Once we obtain the change with the input we can easily calculate the change in error with the change in weights of the edges incident on that input using the same method we used for W56. Here I am directly writing the result.

These are the changes of error with a change in the weights of edges. Now, calculate for Y2 and Y3. But, one thing to notice is, when we are going to calculate the change in error with a change in Y2 and Y3 from backpropagation, they will be affected by both the edges from Y5 and Y4.

So, the change will be a sum of the effect of change in node 4 and node 5.

We can calculate the effects in a similar way we calculated dE/dY5

So, we can generalize it to:

where the ith node is in the Lth layer and the jth node is at the (L+1)th layer.

Now, once we find, the change in error with a change in weight for all the edges. We can update the weights and start learning for the next epoch using the formula.

where alpha is the learning rate. This is how the backpropagation algorithm actually works.


Most used optimizers are:

  1. Adam
  2. Adagrad
  3. RMSProp
  4. SGD.

Adam is the most commonly used optimizer.