1 00:00:01,370 --> 00:00:07,440 In this lecture, we are going to understand the concept behind how neural networks actually learn. 2 00:00:08,860 --> 00:00:12,660 So tell this lecture, we have code what that neural network is. 3 00:00:13,800 --> 00:00:17,760 Now we are starting with how does a neural network work? 4 00:00:20,910 --> 00:00:22,170 Here's a quick recap. 5 00:00:24,170 --> 00:00:28,980 A neural network is a network of cells in our network. 6 00:00:29,220 --> 00:00:38,010 We are going to use sigmoid neurons because these learn in a more controllable manner in a sigmoid neuron. 7 00:00:38,640 --> 00:00:39,510 Two things happen. 8 00:00:40,860 --> 00:00:47,950 We first multiply the input features with the words that are represented by W and then add a bias. 9 00:00:48,140 --> 00:00:51,260 Don't be disvalue. 10 00:00:51,600 --> 00:00:54,510 We name is Z or Z. 11 00:00:56,010 --> 00:00:59,790 The second step is the application of sigmoid auditoria state function. 12 00:01:00,450 --> 00:01:03,570 That is, DeSales calculates one upon one. 13 00:01:03,570 --> 00:01:05,660 Plus it is two but minus the. 14 00:01:07,140 --> 00:01:09,300 This value is the output of the cell. 15 00:01:09,840 --> 00:01:12,180 And is always between zero and one. 16 00:01:13,800 --> 00:01:17,220 This output becomes the new input for the next list. 17 00:01:18,480 --> 00:01:24,540 This continues till the last layer, till we get the final output as per our network. 18 00:01:26,940 --> 00:01:29,580 Now the problem for which we are solving is this. 19 00:01:30,750 --> 00:01:39,600 We want to find out DVDs and biases of all the cells in the system so that the final output of this 20 00:01:39,600 --> 00:01:44,160 network is as close to the actual value of the variable to be predicted. 21 00:01:47,040 --> 00:01:52,680 For better understanding, let us just calculate the number of variables we need to calculate. 22 00:01:52,770 --> 00:02:00,780 For this small network here, we have two neurons in the hidden layer and one neuron in the output layer. 23 00:02:01,620 --> 00:02:04,920 And there are two input features, x1 and X2. 24 00:02:06,570 --> 00:02:16,980 So for the first neuron we are trying to calculate W one into X1 plus W2 into X2 plus B one is equal 25 00:02:16,980 --> 00:02:17,840 to Z. 26 00:02:19,770 --> 00:02:23,520 This zie will be put into the activation function. 27 00:02:23,670 --> 00:02:25,000 That is the sigmoid function. 28 00:02:25,470 --> 00:02:27,900 And that will be the output of this neuron. 29 00:02:30,090 --> 00:02:35,820 Let's say that the output of this neuron is represented by even for the second neuron. 30 00:02:36,270 --> 00:02:39,790 We have two new weights, W three and W four. 31 00:02:40,470 --> 00:02:45,840 We calculate this value w3 x1 plus W4 x2 plus B2. 32 00:02:46,110 --> 00:02:50,430 This B2 is the bias of this neuron is equal to Z2. 33 00:02:51,660 --> 00:02:59,670 We apply activation function on the Z2 to get it to these A1 and A2. 34 00:02:59,820 --> 00:03:06,060 Are the inputs to this final output neuron for these two inputs. 35 00:03:06,510 --> 00:03:09,900 We need two new weights, W5 and W6. 36 00:03:11,100 --> 00:03:19,980 So the equation at this output neuron is W5 into even plus W6 into A2 plus B3 gives Z3. 37 00:03:21,360 --> 00:03:28,350 When we apply activation function on this Z3, we get the predicted output from this output neuron. 38 00:03:31,350 --> 00:03:40,650 So if you look at the variables that we need to estimate for weights, we have W1, W2, WTW for W5 39 00:03:40,650 --> 00:03:41,460 and W6. 40 00:03:41,910 --> 00:03:45,990 So we are estimating six wait for BIAS'S. 41 00:03:46,080 --> 00:03:50,040 We have B1, B2 and B3 three Bias's. 42 00:03:51,180 --> 00:03:59,610 So for this small network we need to establish the values of nine variables to make this neural network 43 00:03:59,760 --> 00:04:00,900 ready for predictions. 44 00:04:04,980 --> 00:04:13,420 Now how do we find out the values of WS and beats the technique followed for. 45 00:04:13,470 --> 00:04:15,300 This is called gradient descent. 46 00:04:17,220 --> 00:04:22,260 Gradient descent is just another optimization technique to find minimum of a function. 47 00:04:24,090 --> 00:04:29,730 There are other optimization techniques also, such as ordinarily squared, which is used in linear 48 00:04:29,730 --> 00:04:30,240 regression. 49 00:04:31,780 --> 00:04:40,170 But for a large number of features and complex relationships, gradient descent shows much better computational 50 00:04:40,170 --> 00:04:42,390 performance than any other technique. 51 00:04:44,280 --> 00:04:50,640 This means that if you have a large number of input variables and a very complex relationship between 52 00:04:50,730 --> 00:04:59,590 input on output, gradient descent will train the model in a much faster way as compared to other optimization. 53 00:05:01,550 --> 00:05:06,200 So let's first discuss the process followed and gradient descent in a stepwise manner. 54 00:05:10,070 --> 00:05:18,090 We start by assigning a random vade and bias values to all this is in our network. 55 00:05:20,660 --> 00:05:26,870 Since all the words and biased values are available, that is, we have randomly assigned all weight 56 00:05:26,900 --> 00:05:27,560 and biases. 57 00:05:28,580 --> 00:05:30,670 Our model is ready to give out output. 58 00:05:33,470 --> 00:05:37,280 The second step is we input one training example. 59 00:05:38,540 --> 00:05:45,830 We use the X values of the training example and calculate the final output of the network using these 60 00:05:45,830 --> 00:05:47,570 weights and by its values. 61 00:05:50,180 --> 00:05:59,510 Third step is that we compare the predicted values vs. the actual values and we know the difference 62 00:05:59,570 --> 00:06:02,510 between these two using some air function. 63 00:06:02,640 --> 00:06:06,200 E will come back to this function later. 64 00:06:07,250 --> 00:06:12,920 Remember that we have the actual Y value because this was a training observation. 65 00:06:14,150 --> 00:06:20,720 So these actual values are being used to give feedback to our network that how bad it is performing. 66 00:06:24,190 --> 00:06:34,510 The fourth step is we try to find out those weights and biases changing, which we can reduce this error. 67 00:06:36,280 --> 00:06:40,030 Lastly, we have daily values of debates and bias. 68 00:06:41,590 --> 00:06:44,270 And I repeat this process from step two. 69 00:06:46,180 --> 00:06:48,570 This loop goes on the law. 70 00:06:48,580 --> 00:06:51,370 Further reduction and edit function can be achieved. 71 00:06:53,740 --> 00:06:59,530 And these steps, the first step is called initialization head. 72 00:06:59,710 --> 00:07:03,540 We just give some random initial values to bits and bytes. 73 00:07:06,040 --> 00:07:09,310 The second step is called forward propagation. 74 00:07:10,690 --> 00:07:14,830 This is because in this step, we start with input values. 75 00:07:15,910 --> 00:07:17,290 Process them in layer one. 76 00:07:18,550 --> 00:07:21,730 Then we take the output of layer one and process it. 77 00:07:21,880 --> 00:07:22,630 And layer two. 78 00:07:22,840 --> 00:07:23,620 And so on. 79 00:07:24,190 --> 00:07:26,530 Then we get one final predicted output. 80 00:07:28,810 --> 00:07:32,500 We are simply moving forward in terms of the layers of the network. 81 00:07:33,580 --> 00:07:35,020 So this is forward propagation. 82 00:07:37,360 --> 00:07:42,790 The third and fourth step are quite backward propagation in these steps. 83 00:07:43,060 --> 00:07:50,650 We already have the final error function and we look backwards in our network to find out which weeds 84 00:07:51,160 --> 00:07:55,090 and biases have maximum impact on this error function. 85 00:07:56,920 --> 00:08:00,730 Once we establish which weights and biases have maximum impact. 86 00:08:01,510 --> 00:08:06,160 We update these Weirton biases slightly to reduce the error. 87 00:08:07,510 --> 00:08:12,520 So this is the process we follow to implement gradient descent in neural networks. 88 00:08:13,810 --> 00:08:15,490 But we are still not discussed. 89 00:08:15,760 --> 00:08:17,590 The concept behind gradient descent 90 00:08:22,360 --> 00:08:28,150 gradient descent is a mathematical technique which is used to find out B minimum of A function. 91 00:08:29,980 --> 00:08:36,430 Let's see this example in the graph on the left on the x axis. 92 00:08:36,670 --> 00:08:39,220 I have this variable on the Y axis. 93 00:08:39,340 --> 00:08:41,350 I have a function applied on this variable. 94 00:08:43,180 --> 00:08:44,740 This is the plot of dysfunction. 95 00:08:47,410 --> 00:08:55,210 Now, if you want to find out the value of X at which the function has minimum value, there are two 96 00:08:55,210 --> 00:08:55,780 ways to it. 97 00:08:57,220 --> 00:09:04,610 One is if you know the exact relationship between X and the function, you can use calculus to find 98 00:09:04,610 --> 00:09:06,100 the minimum of this function. 99 00:09:08,610 --> 00:09:14,350 But as you know, in our machine learning problems, we do not have this exact relationship. 100 00:09:16,090 --> 00:09:19,900 So we use a second technique, which is an attractive technique. 101 00:09:21,340 --> 00:09:22,210 And this technique. 102 00:09:22,570 --> 00:09:25,210 We start at a random point on this plot. 103 00:09:26,640 --> 00:09:36,730 So if we have this value of X and affix now, instead of focusing on the whole graph, we focus only 104 00:09:36,730 --> 00:09:43,900 on this small part of the graph and try to find out what happens if we slightly increase the value of 105 00:09:43,900 --> 00:09:46,160 X or decrease the value of X. 106 00:09:48,040 --> 00:09:52,240 In other words, we are trying to find out which way is the slope. 107 00:09:53,380 --> 00:10:00,160 If the slope is negative, that is like this, we increase the value of X a little bit. 108 00:10:01,120 --> 00:10:03,790 And then we will see that X will also decrease. 109 00:10:04,720 --> 00:10:11,860 Similarly, if the slope is positive, we decrease the value of X, which will slightly decrease the 110 00:10:11,860 --> 00:10:13,290 value of ethics. 111 00:10:16,810 --> 00:10:19,330 We continue taking these small, small steps. 112 00:10:19,570 --> 00:10:29,020 They'll release the final minimum point when we are at this point moving either side only increases 113 00:10:29,020 --> 00:10:30,100 the value of the function. 114 00:10:31,690 --> 00:10:33,670 So we stop that process here. 115 00:10:35,560 --> 00:10:43,300 This iterative technique of finding instantaneous slope, also known as gradient and slightly moving 116 00:10:43,300 --> 00:10:48,160 down that slope that is descent is called gradient descent. 117 00:10:53,260 --> 00:10:58,510 If you want to picture this, this, you can think of yourself being on top of a hill. 118 00:11:00,160 --> 00:11:03,640 You cannot see anything around you because it is dark and foggy. 119 00:11:04,870 --> 00:11:08,380 No, you want to come down the hill as fast as possible. 120 00:11:09,100 --> 00:11:09,760 What do you do? 121 00:11:11,590 --> 00:11:17,460 Ideally, if you could see, you would spot the closest downhill point and run to it. 122 00:11:19,000 --> 00:11:26,260 But since you cannot see and you know, the gradient descent technique, you take a step in each direction, 123 00:11:27,430 --> 00:11:35,590 see which direction is more down, and then you move from your current position to that new position. 124 00:11:37,270 --> 00:11:42,970 Then you again take out your right foot, Jack, which is the direction of steep slope and move. 125 00:11:42,980 --> 00:11:45,640 Did you keep doing this? 126 00:11:46,090 --> 00:11:48,700 And eventually you will come downhill. 127 00:11:50,350 --> 00:11:52,810 This is the concept behind gradient descent. 128 00:11:54,390 --> 00:11:58,330 And the next lecture we will must be two ideas. 129 00:11:58,780 --> 00:12:03,270 The first is to process that neural networks used to implement gradient descent. 130 00:12:04,060 --> 00:12:07,660 And the second idea was what gradient descent is. 131 00:12:07,750 --> 00:12:15,460 Mathematically, we will merge these two and understand how gradient descent is helping us achieve the 132 00:12:15,460 --> 00:12:17,470 minima and neural networks.