1
00:00:01,380 --> 00:00:08,870
In this lecture we are going to understand the concept behind how neural networks actually learn.

2
00:00:08,900 --> 00:00:15,290
Still this lecture we have called what the neural network is now we are starting with.

3
00:00:15,600 --> 00:00:20,090
How does a neural network work.

4
00:00:20,470 --> 00:00:23,780
Here is a quick recap.

5
00:00:24,170 --> 00:00:29,220
A neural network is a network of cells in our network.

6
00:00:29,220 --> 00:00:38,610
We are going to use sigmoid neurons because these learn in a more controllable manner in a sigmoid neuron.

7
00:00:38,640 --> 00:00:40,690
Two things happen.

8
00:00:40,860 --> 00:00:48,720
We first multiply the input features we deviate that are represented by W and then add a bias to be

9
00:00:50,370 --> 00:00:51,560
this value.

10
00:00:51,600 --> 00:00:55,780
We name as Z or Z.

11
00:00:55,980 --> 00:01:02,820
The second step is the application of sigmoid or the logistic function that is the cell calculate one

12
00:01:02,820 --> 00:01:06,790
upon one plus it is the part minus the.

13
00:01:07,140 --> 00:01:13,570
This value is the output of the cell and is always between 0 and 1.

14
00:01:13,800 --> 00:01:18,480
This output becomes the new input for the next layer.

15
00:01:18,480 --> 00:01:23,160
This continues till the last layer till we get the final output.

16
00:01:23,160 --> 00:01:33,360
As per our network now the problem for which we are solving is this we want to find out the ways and

17
00:01:33,360 --> 00:01:42,240
biases of all the cells in the system so that the final output of this network is as close to the actual

18
00:01:42,240 --> 00:01:44,130
value of the variable to be predicted

19
00:01:47,040 --> 00:01:48,250
for better understanding.

20
00:01:48,780 --> 00:01:56,670
Let us just calculate the number of variables we need to calculate for this small network head we have

21
00:01:56,790 --> 00:02:03,600
two neurons in the head and left and one neuron in the output layer and there are two input features

22
00:02:03,930 --> 00:02:06,400
x1 and x2.

23
00:02:06,570 --> 00:02:19,680
So for the first neuron we are trying to calculate W1 into X1 plus W2 into x2 plus B1 is equal to Z.

24
00:02:19,770 --> 00:02:26,940
This V will be put into the activation function that is the sigmoid function and that will be the output

25
00:02:27,060 --> 00:02:30,060
of this neuron.

26
00:02:30,160 --> 00:02:36,230
US say that the output of this neuron is represented by A1 for the second neuron.

27
00:02:36,270 --> 00:02:39,860
We have two new Newgate W3 and w4.

28
00:02:40,420 --> 00:02:51,550
We calculate this value W3 X1 plus W for x2 plus B2 this B2 is the bias of this neuron is equal to z2.

29
00:02:51,700 --> 00:03:02,340
We apply activation function on the z2 to get it to these A1 and A2 are the inputs to this final output

30
00:03:02,340 --> 00:03:06,480
neuron for these two inputs.

31
00:03:06,480 --> 00:03:10,730
We need two new rate W5 and W six.

32
00:03:11,100 --> 00:03:20,630
So the equation at this output neuron is W5 into even plus W 6 into A2 plus b 3 gives z3.

33
00:03:21,360 --> 00:03:31,100
When we apply activation function on this Z3 we get the predicted output from this output neuron.

34
00:03:31,350 --> 00:03:41,040
So if we look at the variables that we need to estimate for weights we have W1 W2 W3 w for w 5 and W

35
00:03:41,040 --> 00:03:41,910
6.

36
00:03:41,910 --> 00:03:50,960
So we are estimating 6 wait for biases we have B1 B2 and B3 3 biases.

37
00:03:51,180 --> 00:03:59,610
So for this small network we need to establish the values of nine variables to make this neural network

38
00:03:59,730 --> 00:04:04,690
ready for predictions.

39
00:04:04,950 --> 00:04:13,450
Now how do we find out the values of WS and beats the technique followed for.

40
00:04:13,460 --> 00:04:21,660
This is called gradient descent gradient descent is just another optimization technique to find minimum

41
00:04:21,660 --> 00:04:23,930
of a function.

42
00:04:24,090 --> 00:04:30,210
There are other optimization techniques also such as ordinarily squared which is used in linear regression

43
00:04:31,810 --> 00:04:40,170
but for a large number of features and complex relationships gradient descent shows much better computational

44
00:04:40,170 --> 00:04:44,250
performance than any other technique.

45
00:04:44,250 --> 00:04:50,640
This means that if you have a large number of input variables and a very complex relationship between

46
00:04:50,730 --> 00:04:59,610
input and output gradient descent will train the model in a much faster way as compared to other optimization

47
00:05:01,550 --> 00:05:10,070
so let's first discuss the process followed in gradient descent in a stepwise manner.

48
00:05:10,070 --> 00:05:20,650
We start by assigning a random weight and bias values to all this is in our network.

49
00:05:20,660 --> 00:05:26,870
Since all the weights and biased values are available that is we have randomly assigned all the weight

50
00:05:26,900 --> 00:05:28,370
and biases.

51
00:05:28,580 --> 00:05:33,470
Our model is ready to give out output.

52
00:05:33,470 --> 00:05:38,540
The second step is the input one training example.

53
00:05:38,540 --> 00:05:45,830
We use the x values of the training example and calculate the final output of the network using these

54
00:05:45,830 --> 00:05:50,160
weights and biased values.

55
00:05:50,180 --> 00:05:59,510
Third step is that we compare the predicted values vs. the actual values and we know the difference

56
00:05:59,570 --> 00:06:02,630
between these two using some error function.

57
00:06:02,630 --> 00:06:07,250
E will come back to this error function later.

58
00:06:07,250 --> 00:06:15,830
Remember that we have the actual y value because this was a training observation so these actual values

59
00:06:15,950 --> 00:06:20,720
are being used to give feedback to our network that how bad it is performing

60
00:06:24,190 --> 00:06:34,540
the fourth step is we try to find all those weights and biases changing which we can reduce this error.

61
00:06:36,280 --> 00:06:46,160
Lastly we update the values of the weights and bias and repeat this process from step two.

62
00:06:46,180 --> 00:06:54,930
This loop goes on the low for the reduction and edit function can be achieved and these steps.

63
00:06:55,060 --> 00:06:59,710
The first step is called initialization head.

64
00:06:59,710 --> 00:07:06,040
We just give some random initial values to base and bias.

65
00:07:06,040 --> 00:07:10,690
The second step is called forward propagation.

66
00:07:10,690 --> 00:07:17,930
This is because in this step we start with input values process them in Layer 1.

67
00:07:18,550 --> 00:07:23,960
Then we take the output of Layer 1 and process it and layer 2 and so on.

68
00:07:24,190 --> 00:07:28,750
Then we get one final predicted output.

69
00:07:28,810 --> 00:07:33,600
We are simply moving forward in terms of the layers of the network.

70
00:07:33,600 --> 00:07:37,330
So this is forward propagation.

71
00:07:37,330 --> 00:07:43,000
The third and fourth step are quite backward propagation in these steps.

72
00:07:43,060 --> 00:07:50,650
We already have the final edit function and we look backwards in our network to find out which weeds

73
00:07:51,190 --> 00:07:56,890
and biases have maximum impact on this error function.

74
00:07:56,890 --> 00:08:01,510
Once we establish which weights and biases have maximum impact.

75
00:08:01,510 --> 00:08:06,190
We update these weights and biases slightly to reduce the edit.

76
00:08:07,510 --> 00:08:14,590
So this is the process we follow to implement gradient descent and neural networks but we are still

77
00:08:14,590 --> 00:08:17,590
not discussed the concept behind gradient descent

78
00:08:22,330 --> 00:08:29,990
gradient descent is a mathematical technique which is used to find out the minimum of a function.

79
00:08:29,990 --> 00:08:36,600
Let's see this example in the graph on the left on the x axis.

80
00:08:36,670 --> 00:08:43,110
I have this variable and on the y axis I have a function applied on this variable.

81
00:08:43,180 --> 00:08:47,420
This is the plot of this function.

82
00:08:47,420 --> 00:08:55,430
Now if you want to find out the value of X at which the function has minimum value there are two ways

83
00:08:55,430 --> 00:08:57,220
to it.

84
00:08:57,220 --> 00:09:04,720
One is if you know the exact relationship between X and the function you can use calculus to find the

85
00:09:04,720 --> 00:09:08,610
minimum of this function.

86
00:09:08,610 --> 00:09:14,950
But as you know in r machine learning problems we do not have this exact relationship.

87
00:09:16,090 --> 00:09:22,570
So we use a second technique which is an attractive technique in this technique.

88
00:09:22,570 --> 00:09:30,930
We start at a random point on this block saying we have this value of x and ethics.

89
00:09:31,750 --> 00:09:39,520
Now instead of focusing on the whole graph we focus only on this small part of the graph and try to

90
00:09:39,520 --> 00:09:47,870
find out what happens if we slightly increase the value of X or decrease the value of x.

91
00:09:48,010 --> 00:09:53,250
In other words we are trying to find out which way is this slope.

92
00:09:53,380 --> 00:10:02,050
If this globe is negative that is like this we increase the value of X a little bit and then we will

93
00:10:02,050 --> 00:10:04,720
see that if X will also decrease.

94
00:10:04,720 --> 00:10:12,250
Similarly if the slope is positive we decrease the value of x which will slightly decrease the value

95
00:10:12,250 --> 00:10:13,300
of Epix

96
00:10:16,810 --> 00:10:25,570
we continue taking these small small steps till we reach the final minimum point when we are at this

97
00:10:25,570 --> 00:10:35,060
point moving either side only increases the value of the function so we stop our process here.

98
00:10:35,560 --> 00:10:43,600
This outdated technique of finding instantaneous slope also known as gradient and slightly moving down

99
00:10:43,600 --> 00:10:48,140
that slope that is descent is called gradient descent.

100
00:10:53,230 --> 00:10:59,370
If you want to picture this you can think of yourself being on top of a hill.

101
00:11:00,160 --> 00:11:04,880
You cannot see anything around you because it is dark and foggy.

102
00:11:04,880 --> 00:11:08,990
Now you want to come down the hill as fast as possible.

103
00:11:09,100 --> 00:11:11,030
What do you do.

104
00:11:11,590 --> 00:11:17,730
Ideally if you could see you would spot the closest down hill point and run do it.

105
00:11:19,000 --> 00:11:26,230
But since you cannot see and you know the gradient descent technique you take a step in each direction

106
00:11:27,430 --> 00:11:36,440
see which direction is more down and then you move from your current position to that new position.

107
00:11:37,270 --> 00:11:43,010
Then you again Take out your right foot check which is the direction of steep slope and move.

108
00:11:43,010 --> 00:11:50,290
Did you keep doing this and eventually you will come downhill.

109
00:11:50,350 --> 00:11:54,040
This is the concept behind gradient descent.

110
00:11:54,400 --> 00:11:58,770
In the next lecture we will merge the two ideas.

111
00:11:58,780 --> 00:12:03,310
The first is the process that neural networks used to implement gradient descent.

112
00:12:04,060 --> 00:12:07,720
And the second idea was what gradient descent is.

113
00:12:07,750 --> 00:12:15,460
Mathematically we will merge these two and understand how gradient descent is helping us achieve the

114
00:12:15,460 --> 00:12:17,440
minima and neural networks.