1
00:00:01,370 --> 00:00:07,440
In this lecture, we are going to understand the concept behind how neural networks actually learn.

2
00:00:08,860 --> 00:00:12,660
So tell this lecture, we have code what that neural network is.

3
00:00:13,800 --> 00:00:17,760
Now we are starting with how does a neural network work?

4
00:00:20,910 --> 00:00:22,170
Here's a quick recap.

5
00:00:24,170 --> 00:00:28,980
A neural network is a network of cells in our network.

6
00:00:29,220 --> 00:00:38,010
We are going to use sigmoid neurons because these learn in a more controllable manner in a sigmoid neuron.

7
00:00:38,640 --> 00:00:39,510
Two things happen.

8
00:00:40,860 --> 00:00:47,950
We first multiply the input features with the words that are represented by W and then add a bias.

9
00:00:48,140 --> 00:00:51,260
Don't be disvalue.

10
00:00:51,600 --> 00:00:54,510
We name is Z or Z.

11
00:00:56,010 --> 00:00:59,790
The second step is the application of sigmoid auditoria state function.

12
00:01:00,450 --> 00:01:03,570
That is, DeSales calculates one upon one.

13
00:01:03,570 --> 00:01:05,660
Plus it is two but minus the.

14
00:01:07,140 --> 00:01:09,300
This value is the output of the cell.

15
00:01:09,840 --> 00:01:12,180
And is always between zero and one.

16
00:01:13,800 --> 00:01:17,220
This output becomes the new input for the next list.

17
00:01:18,480 --> 00:01:24,540
This continues till the last layer, till we get the final output as per our network.

18
00:01:26,940 --> 00:01:29,580
Now the problem for which we are solving is this.

19
00:01:30,750 --> 00:01:39,600
We want to find out DVDs and biases of all the cells in the system so that the final output of this

20
00:01:39,600 --> 00:01:44,160
network is as close to the actual value of the variable to be predicted.

21
00:01:47,040 --> 00:01:52,680
For better understanding, let us just calculate the number of variables we need to calculate.

22
00:01:52,770 --> 00:02:00,780
For this small network here, we have two neurons in the hidden layer and one neuron in the output layer.

23
00:02:01,620 --> 00:02:04,920
And there are two input features, x1 and X2.

24
00:02:06,570 --> 00:02:16,980
So for the first neuron we are trying to calculate W one into X1 plus W2 into X2 plus B one is equal

25
00:02:16,980 --> 00:02:17,840
to Z.

26
00:02:19,770 --> 00:02:23,520
This zie will be put into the activation function.

27
00:02:23,670 --> 00:02:25,000
That is the sigmoid function.

28
00:02:25,470 --> 00:02:27,900
And that will be the output of this neuron.

29
00:02:30,090 --> 00:02:35,820
Let's say that the output of this neuron is represented by even for the second neuron.

30
00:02:36,270 --> 00:02:39,790
We have two new weights, W three and W four.

31
00:02:40,470 --> 00:02:45,840
We calculate this value w3 x1 plus W4 x2 plus B2.

32
00:02:46,110 --> 00:02:50,430
This B2 is the bias of this neuron is equal to Z2.

33
00:02:51,660 --> 00:02:59,670
We apply activation function on the Z2 to get it to these A1 and A2.

34
00:02:59,820 --> 00:03:06,060
Are the inputs to this final output neuron for these two inputs.

35
00:03:06,510 --> 00:03:09,900
We need two new weights, W5 and W6.

36
00:03:11,100 --> 00:03:19,980
So the equation at this output neuron is W5 into even plus W6 into A2 plus B3 gives Z3.

37
00:03:21,360 --> 00:03:28,350
When we apply activation function on this Z3, we get the predicted output from this output neuron.

38
00:03:31,350 --> 00:03:40,650
So if you look at the variables that we need to estimate for weights, we have W1, W2, WTW for W5

39
00:03:40,650 --> 00:03:41,460
and W6.

40
00:03:41,910 --> 00:03:45,990
So we are estimating six wait for BIAS'S.

41
00:03:46,080 --> 00:03:50,040
We have B1, B2 and B3 three Bias's.

42
00:03:51,180 --> 00:03:59,610
So for this small network we need to establish the values of nine variables to make this neural network

43
00:03:59,760 --> 00:04:00,900
ready for predictions.

44
00:04:04,980 --> 00:04:13,420
Now how do we find out the values of WS and beats the technique followed for.

45
00:04:13,470 --> 00:04:15,300
This is called gradient descent.

46
00:04:17,220 --> 00:04:22,260
Gradient descent is just another optimization technique to find minimum of a function.

47
00:04:24,090 --> 00:04:29,730
There are other optimization techniques also, such as ordinarily squared, which is used in linear

48
00:04:29,730 --> 00:04:30,240
regression.

49
00:04:31,780 --> 00:04:40,170
But for a large number of features and complex relationships, gradient descent shows much better computational

50
00:04:40,170 --> 00:04:42,390
performance than any other technique.

51
00:04:44,280 --> 00:04:50,640
This means that if you have a large number of input variables and a very complex relationship between

52
00:04:50,730 --> 00:04:59,590
input on output, gradient descent will train the model in a much faster way as compared to other optimization.

53
00:05:01,550 --> 00:05:06,200
So let's first discuss the process followed and gradient descent in a stepwise manner.

54
00:05:10,070 --> 00:05:18,090
We start by assigning a random vade and bias values to all this is in our network.

55
00:05:20,660 --> 00:05:26,870
Since all the words and biased values are available, that is, we have randomly assigned all weight

56
00:05:26,900 --> 00:05:27,560
and biases.

57
00:05:28,580 --> 00:05:30,670
Our model is ready to give out output.

58
00:05:33,470 --> 00:05:37,280
The second step is we input one training example.

59
00:05:38,540 --> 00:05:45,830
We use the X values of the training example and calculate the final output of the network using these

60
00:05:45,830 --> 00:05:47,570
weights and by its values.

61
00:05:50,180 --> 00:05:59,510
Third step is that we compare the predicted values vs. the actual values and we know the difference

62
00:05:59,570 --> 00:06:02,510
between these two using some air function.

63
00:06:02,640 --> 00:06:06,200
E will come back to this function later.

64
00:06:07,250 --> 00:06:12,920
Remember that we have the actual Y value because this was a training observation.

65
00:06:14,150 --> 00:06:20,720
So these actual values are being used to give feedback to our network that how bad it is performing.

66
00:06:24,190 --> 00:06:34,510
The fourth step is we try to find out those weights and biases changing, which we can reduce this error.

67
00:06:36,280 --> 00:06:40,030
Lastly, we have daily values of debates and bias.

68
00:06:41,590 --> 00:06:44,270
And I repeat this process from step two.

69
00:06:46,180 --> 00:06:48,570
This loop goes on the law.

70
00:06:48,580 --> 00:06:51,370
Further reduction and edit function can be achieved.

71
00:06:53,740 --> 00:06:59,530
And these steps, the first step is called initialization head.

72
00:06:59,710 --> 00:07:03,540
We just give some random initial values to bits and bytes.

73
00:07:06,040 --> 00:07:09,310
The second step is called forward propagation.

74
00:07:10,690 --> 00:07:14,830
This is because in this step, we start with input values.

75
00:07:15,910 --> 00:07:17,290
Process them in layer one.

76
00:07:18,550 --> 00:07:21,730
Then we take the output of layer one and process it.

77
00:07:21,880 --> 00:07:22,630
And layer two.

78
00:07:22,840 --> 00:07:23,620
And so on.

79
00:07:24,190 --> 00:07:26,530
Then we get one final predicted output.

80
00:07:28,810 --> 00:07:32,500
We are simply moving forward in terms of the layers of the network.

81
00:07:33,580 --> 00:07:35,020
So this is forward propagation.

82
00:07:37,360 --> 00:07:42,790
The third and fourth step are quite backward propagation in these steps.

83
00:07:43,060 --> 00:07:50,650
We already have the final error function and we look backwards in our network to find out which weeds

84
00:07:51,160 --> 00:07:55,090
and biases have maximum impact on this error function.

85
00:07:56,920 --> 00:08:00,730
Once we establish which weights and biases have maximum impact.

86
00:08:01,510 --> 00:08:06,160
We update these Weirton biases slightly to reduce the error.

87
00:08:07,510 --> 00:08:12,520
So this is the process we follow to implement gradient descent in neural networks.

88
00:08:13,810 --> 00:08:15,490
But we are still not discussed.

89
00:08:15,760 --> 00:08:17,590
The concept behind gradient descent

90
00:08:22,360 --> 00:08:28,150
gradient descent is a mathematical technique which is used to find out B minimum of A function.

91
00:08:29,980 --> 00:08:36,430
Let's see this example in the graph on the left on the x axis.

92
00:08:36,670 --> 00:08:39,220
I have this variable on the Y axis.

93
00:08:39,340 --> 00:08:41,350
I have a function applied on this variable.

94
00:08:43,180 --> 00:08:44,740
This is the plot of dysfunction.

95
00:08:47,410 --> 00:08:55,210
Now, if you want to find out the value of X at which the function has minimum value, there are two

96
00:08:55,210 --> 00:08:55,780
ways to it.

97
00:08:57,220 --> 00:09:04,610
One is if you know the exact relationship between X and the function, you can use calculus to find

98
00:09:04,610 --> 00:09:06,100
the minimum of this function.

99
00:09:08,610 --> 00:09:14,350
But as you know, in our machine learning problems, we do not have this exact relationship.

100
00:09:16,090 --> 00:09:19,900
So we use a second technique, which is an attractive technique.

101
00:09:21,340 --> 00:09:22,210
And this technique.

102
00:09:22,570 --> 00:09:25,210
We start at a random point on this plot.

103
00:09:26,640 --> 00:09:36,730
So if we have this value of X and affix now, instead of focusing on the whole graph, we focus only

104
00:09:36,730 --> 00:09:43,900
on this small part of the graph and try to find out what happens if we slightly increase the value of

105
00:09:43,900 --> 00:09:46,160
X or decrease the value of X.

106
00:09:48,040 --> 00:09:52,240
In other words, we are trying to find out which way is the slope.

107
00:09:53,380 --> 00:10:00,160
If the slope is negative, that is like this, we increase the value of X a little bit.

108
00:10:01,120 --> 00:10:03,790
And then we will see that X will also decrease.

109
00:10:04,720 --> 00:10:11,860
Similarly, if the slope is positive, we decrease the value of X, which will slightly decrease the

110
00:10:11,860 --> 00:10:13,290
value of ethics.

111
00:10:16,810 --> 00:10:19,330
We continue taking these small, small steps.

112
00:10:19,570 --> 00:10:29,020
They'll release the final minimum point when we are at this point moving either side only increases

113
00:10:29,020 --> 00:10:30,100
the value of the function.

114
00:10:31,690 --> 00:10:33,670
So we stop that process here.

115
00:10:35,560 --> 00:10:43,300
This iterative technique of finding instantaneous slope, also known as gradient and slightly moving

116
00:10:43,300 --> 00:10:48,160
down that slope that is descent is called gradient descent.

117
00:10:53,260 --> 00:10:58,510
If you want to picture this, this, you can think of yourself being on top of a hill.

118
00:11:00,160 --> 00:11:03,640
You cannot see anything around you because it is dark and foggy.

119
00:11:04,870 --> 00:11:08,380
No, you want to come down the hill as fast as possible.

120
00:11:09,100 --> 00:11:09,760
What do you do?

121
00:11:11,590 --> 00:11:17,460
Ideally, if you could see, you would spot the closest downhill point and run to it.

122
00:11:19,000 --> 00:11:26,260
But since you cannot see and you know, the gradient descent technique, you take a step in each direction,

123
00:11:27,430 --> 00:11:35,590
see which direction is more down, and then you move from your current position to that new position.

124
00:11:37,270 --> 00:11:42,970
Then you again take out your right foot, Jack, which is the direction of steep slope and move.

125
00:11:42,980 --> 00:11:45,640
Did you keep doing this?

126
00:11:46,090 --> 00:11:48,700
And eventually you will come downhill.

127
00:11:50,350 --> 00:11:52,810
This is the concept behind gradient descent.

128
00:11:54,390 --> 00:11:58,330
And the next lecture we will must be two ideas.

129
00:11:58,780 --> 00:12:03,270
The first is to process that neural networks used to implement gradient descent.

130
00:12:04,060 --> 00:12:07,660
And the second idea was what gradient descent is.

131
00:12:07,750 --> 00:12:15,460
Mathematically, we will merge these two and understand how gradient descent is helping us achieve the

132
00:12:15,460 --> 00:12:17,470
minima and neural networks.