Let’s start the discussion on Forward Propagation in ANN.
I will try to explain it in a very simple way. If you have any difficulty to understand then please let me know in the below comment section.
Well, first of all we shall decide our neural network layer number. Let’s say 2.
Input layer -> hidden layer -> output layer. We never consider the input layer as a layer number. So, we have 2 layers(1 hidden layer and output layer).
Suppose we have a input layer X and the transfer function Z= W.X+ B
Before start the further discussion, I want to clarify the above picture.
If we have 1 single example with 1 input variable then it’s a scalar input.If we have 1 example but 2 input variables then it’s a vector and if we have 2 examples with 2 input variables then it’s a matrix as per above picture. I shall not discuss about tensor today because I shall discuss on it later.
Forward propagation means we are constructing the flow of function upto output layer to get the predicted value Yhat.
So the flow is:
X – > Z=W.X+B -> Activation function(Z)-> Yhat.
Now the interesting part ! How to minimize the error that is ( actual value – predicted value ) =( Y – Yhat).
This is called loss function
= -ylog yhat -(1-y)log(1-yhat).
How to interpret it?
Log 0 =1 and Log1 =0. That means when the value is near to zero then log of that value is near to 1 and when value is near to 1 then log of that value is near to 0.
Now if Y is 0 then the loss function is -log(1-yhat).
To minimize the loss function ,the value of (1-yhat) must be 1 . So (1-yhat) =1 , then yhat=1-1=0.
Now if Y is 1 then the loss function is -log(yhat).
To minimize the loss function , yhat must be 1.
I hope that the intution is clear to you.
Now the next question is how to minimize the loss function? Before start the discussion on backward propagation , I want to remind you the cost function which is nothing but the loss function having all our training examples together :
To minimize the loss or cost function , we shall apply the concept of bowl theory of gradient decent algorithm.
Suppose, you are on a top of hill. You need to go down to reach home. What is the safest way? Choose the immediate down path and step down carefully like baby step. Then you can go down at the bottom of hill by tiny tiny steps.
Our goal is same. If we draw our loss function then it will look like the above bowl. Our intention is to get minimum loss when our points are anywhere in the hill. We will do the same thing. We will gradually step down according to slope by tiny steps.
The tiny step is called learning rate which is denoted by alpha. And the slope is derivative function. I will request to read the basic of calculus to understand derivative function.
Now our interest is d(loss)/dW and d(loss)/dB.
d(loss)/dW = (d(loss)/dZ) * (dZ/dW). Which is chain rule in calculus.
To calculate (d(loss)/dZ) , we have loss function
y- g(Z). And to calculate (dZ/dW), we have function Z=WX+b.
Now , the algorithm is for each iteration of training,
W = W- (alpha * d(loss)/dW).
Alpha is a hyper parameter and the value is decided before the process starts. The default value is 0.01. You can change it based on your performance of algorithm. If we make it very big then it will overshoot and the parameter value never be converged to minimize the loss function and if the value is too small then it will take very long time to converge.
Now after multiple iteration, the value of W will go down according to slope and at the minimum point, the derivation will be zero. If you remember the slope is always zero at the minimum point of the curve.
At this point, your algorithm will stop to further update the parameter and you will get a final result of w.
The same thing is applicable for b also.
At the end, you will get a final formula Z with a perfect value of w and b to get the predicted value which will be near to same as actual value.
I hope the explanation is clear to you.