1 00:00:01,440 --> 00:00:01,770 OK. 2 00:00:02,250 --> 00:00:05,700 In the last lecture, we learn what gradient descent is. 3 00:00:06,990 --> 00:00:13,980 In this lecture, we are going to see how to use this mathematical technique to find the optimum W's 4 00:00:14,100 --> 00:00:14,790 and B's. 5 00:00:16,170 --> 00:00:21,690 For this, we first need to understand the air function, which we discussed in the last lecture. 6 00:00:22,590 --> 00:00:26,310 Here are the five steps that we use to implement gradient descent. 7 00:00:28,200 --> 00:00:31,890 The first step is to give random readings to all W's and B's in the system. 8 00:00:32,970 --> 00:00:38,850 Then we take one training example and put its X values as input to our system. 9 00:00:40,020 --> 00:00:44,400 We process through the entire network to get one predicted value. 10 00:00:46,860 --> 00:00:53,550 Now, on this third step, I told you that we measure the distance between predicted and the actual 11 00:00:53,550 --> 00:00:55,920 value using an error function. 12 00:00:57,990 --> 00:00:59,220 Let's see what this means. 13 00:01:00,720 --> 00:01:03,870 Suppose we predicted an output of zero point three. 14 00:01:04,700 --> 00:01:07,460 What is the actual value is zero. 15 00:01:09,060 --> 00:01:14,190 One way of calculating error of prediction could be just to subtract these two. 16 00:01:14,580 --> 00:01:21,540 That is finding out actual minus predicted, which will be zero minus zero point three, giving us minus 17 00:01:21,660 --> 00:01:22,590 zero point three. 18 00:01:24,740 --> 00:01:31,050 To remove this negative sign in the air and focus only on the magnitude of this error, we can simply 19 00:01:31,050 --> 00:01:41,170 put an absolute function or a square function on top of it, basically meaning minus zero point three 20 00:01:41,190 --> 00:01:44,760 would become point three or it would be squared. 21 00:01:44,850 --> 00:01:46,650 And it will become zero point zero nine. 22 00:01:50,250 --> 00:01:56,610 These two are good measures of error, but they do not work well when we are doing classification with 23 00:01:56,610 --> 00:01:57,510 neural networks. 24 00:01:59,590 --> 00:02:02,890 For this purpose, we use a different function. 25 00:02:06,430 --> 00:02:09,700 This one is called Cross Entropy Lost Function. 26 00:02:10,970 --> 00:02:12,740 It is represented by this formula. 27 00:02:14,170 --> 00:02:26,500 He is equal to minus of white into long white ash, minus one minus Y logoff one minus my life here. 28 00:02:27,190 --> 00:02:32,800 Why represents actual value and why Dyche represents deep predicted output value. 29 00:02:36,070 --> 00:02:42,010 I know this looks complex, much complex tended to edit functions that we saw in the last late. 30 00:02:43,090 --> 00:02:48,980 But the reason for using this is that this function does not have local minimums. 31 00:02:50,650 --> 00:02:57,550 That is the graph of this function looks like this one on the left and not like this one on the right. 32 00:02:59,380 --> 00:03:07,630 If a function has local minimums of gradient, descent won't work properly and it might stop here instead 33 00:03:07,630 --> 00:03:09,890 of finding the global minima which is here. 34 00:03:12,730 --> 00:03:15,760 If you don't understand the last comment, don't worry about it. 35 00:03:16,660 --> 00:03:19,870 The simple takeaway is for classification problems. 36 00:03:20,560 --> 00:03:24,430 The added function to be used is this cross entropy added function. 37 00:03:26,320 --> 00:03:30,110 We can take a look at this edit function to build some intuition around it. 38 00:03:32,420 --> 00:03:38,180 As you know, in classification problems, the output value is either zero or one. 39 00:03:39,170 --> 00:03:48,470 So if the output value is one, the second part of this function that is one minus Y, this entire part 40 00:03:48,470 --> 00:03:54,980 will become zero because one minus one would be zero if the actual output is zero. 41 00:03:55,700 --> 00:03:58,730 Then the first term of this equation will become zero. 42 00:03:58,850 --> 00:04:00,680 And only the second time will remain. 43 00:04:02,350 --> 00:04:08,320 So let's see if the actual output is one for this error function to be minimum. 44 00:04:08,530 --> 00:04:11,050 The function should be as close to zero as possible. 45 00:04:12,310 --> 00:04:18,610 Let's see if Y is equal to one that is minus of one and two. 46 00:04:18,610 --> 00:04:24,790 Log where that plus one minus one log, one minus Y that the second time becomes zero. 47 00:04:25,420 --> 00:04:28,870 So we are left with only minus of log white x. 48 00:04:30,390 --> 00:04:35,640 So for this error to be small minus log, why that should be small. 49 00:04:37,260 --> 00:04:40,350 This implies that blog eyes should be large. 50 00:04:41,850 --> 00:04:44,820 This further implies that Vidas should be large. 51 00:04:46,470 --> 00:04:53,790 Since our predicted output is between zero and one, why does being large simply means that white ash 52 00:04:54,180 --> 00:04:56,640 should be as close to one as possible? 53 00:04:58,780 --> 00:05:05,000 Similarly, if actual value of output is zero, the first item of this equation will be zero. 54 00:05:07,690 --> 00:05:12,820 So the edited function, the meaning would be minus logoff one minus y dash. 55 00:05:14,480 --> 00:05:22,780 For this error to be small, logoff one minus Y dash has to be large, implying that one minus white 56 00:05:22,780 --> 00:05:26,990 ice has to be large, implying that Y should be as small as possible. 57 00:05:27,920 --> 00:05:33,350 Although I have not given you the mathematical justification for using this function, but I guess with 58 00:05:33,350 --> 00:05:41,240 these particular examples, you are getting the feel of how minimizing the error or loss function is 59 00:05:41,240 --> 00:05:45,750 trying to match the predicted output value to the actual value. 60 00:05:50,210 --> 00:05:57,530 So now you may have guessed the job of gradient descent is to find the minimum of this error function, 61 00:05:58,430 --> 00:06:05,690 that is, we will make small changes to the values of weights and biases in that direction where we 62 00:06:05,690 --> 00:06:08,060 get maximum degrees in edit. 63 00:06:09,480 --> 00:06:11,660 We will continue changing WS and BS. 64 00:06:11,780 --> 00:06:14,540 They'll know for their degrees and it is possible. 65 00:06:16,130 --> 00:06:18,470 This is how the process looks graphically. 66 00:06:21,040 --> 00:06:28,610 For ease of understanding, I have represented all of DVDs on monitors and all biases on another axes 67 00:06:29,690 --> 00:06:31,040 and on the vertical axis. 68 00:06:31,100 --> 00:06:32,940 We have the corresponding value of it. 69 00:06:35,030 --> 00:06:38,210 These values of error are calculated using the error function. 70 00:06:40,320 --> 00:06:40,610 OK. 71 00:06:41,000 --> 00:06:44,280 So now let's revisit our steps to implement gradient descent. 72 00:06:46,640 --> 00:06:55,910 Again, the first step is setting a random initial values of W and B, then we go forward to get predicted 73 00:06:55,910 --> 00:06:56,660 output value. 74 00:06:58,220 --> 00:07:02,780 Then we put this predicted output value in our lost function to get the error prediction. 75 00:07:04,730 --> 00:07:06,020 Now we have the error. 76 00:07:06,170 --> 00:07:11,210 WS and B's say we have a W value between one and two. 77 00:07:12,240 --> 00:07:17,540 A biased value between zero and minus one and added, well, you need one. 78 00:07:18,500 --> 00:07:21,440 So we're nearly here on this graph. 79 00:07:24,180 --> 00:07:31,940 Now, in the fourth step, we do backward propagation to finding direction of movement on this graph. 80 00:07:33,350 --> 00:07:37,140 Which means we find Delta W and Delta B.. 81 00:07:37,760 --> 00:07:43,100 That is the change in W's and B's that will take us to the minimum point. 82 00:07:45,500 --> 00:07:52,810 If you look at this graph, you can probably see that by decreasing the weight and increasing the bias 83 00:07:52,820 --> 00:07:53,480 values. 84 00:07:54,560 --> 00:07:57,770 We are we'll be moving closer to the lowest point. 85 00:08:00,680 --> 00:08:03,610 So basically, we have initial W's and B's. 86 00:08:04,170 --> 00:08:10,350 We will be updating our W W minus Alpha Times data W. 87 00:08:12,530 --> 00:08:17,090 And we'll be updating our B to B minus Alpha Times, Delta B. 88 00:08:18,740 --> 00:08:20,660 Head Al-Fayez Gauldin Learning Rate. 89 00:08:22,700 --> 00:08:28,370 Basically, Delta W and Delta B are unit steps that we calculate using calculus. 90 00:08:29,690 --> 00:08:33,920 Alpha is controlling the number of those steps we take in that direction. 91 00:08:35,990 --> 00:08:42,250 You can imagine the impact of large versus small values of Alpha if Al-Fayez large. 92 00:08:42,770 --> 00:08:46,100 We are taking multiple steps in the direction of gradient descent. 93 00:08:47,210 --> 00:08:49,840 This means that we can reach the bottom fostered. 94 00:08:51,320 --> 00:08:55,550 But problem at large, alpha is that we can overshoot from the minimum. 95 00:08:57,260 --> 00:08:59,750 Imagine you're very near to the bottom. 96 00:09:00,080 --> 00:09:02,930 But on the next time you take 50 steps. 97 00:09:03,050 --> 00:09:04,190 Instead of just one. 98 00:09:05,390 --> 00:09:08,530 In such a situation, you will plainly go on the other side. 99 00:09:10,430 --> 00:09:18,500 So a large learning rate can help in foster dissent, but can face issue in the final stages of convergence. 100 00:09:19,940 --> 00:09:23,330 Therefore, a moderate value of learning rate is to be used. 101 00:09:24,230 --> 00:09:29,450 You will see what value of learning it is to be used in practical section of the schools. 102 00:09:31,180 --> 00:09:31,660 Very well. 103 00:09:31,960 --> 00:09:39,630 So the steps to be taken in the direction of the descent is four times there are W and Alpha teams there 104 00:09:39,650 --> 00:09:40,120 that be. 105 00:09:42,500 --> 00:09:46,770 Now, how do we find their W and Dubie here? 106 00:09:47,150 --> 00:09:53,570 Delta W. is the change in weight and Delta B is the change in bias. 107 00:09:54,290 --> 00:09:55,710 Basically, we will change. 108 00:09:55,740 --> 00:10:00,040 They initially said W's and B's in the effort to reduce error. 109 00:10:04,040 --> 00:10:11,910 Now, let us see how to find Delta W. and Delta B. These values are formed by doing backward propagation, 110 00:10:13,050 --> 00:10:20,310 which means we will look back in the network to find out the instantaneous slope with respect to eat 111 00:10:20,430 --> 00:10:21,780 W and B.. 112 00:10:24,510 --> 00:10:28,590 Let me take an example with a single neuron to show you how this happens. 113 00:10:29,970 --> 00:10:37,500 Otherwise, the mathematics and calculus in more can get quite messy and is often overwhelming for some 114 00:10:37,500 --> 00:10:38,010 student. 115 00:10:39,930 --> 00:10:46,200 If you're comfortable with calculus, you can look at the complete back propagation theory in the link 116 00:10:46,200 --> 00:10:48,150 shared in the description of this lecture. 117 00:10:49,940 --> 00:10:56,900 However, I think with this simple example, you will get a solid intuition of how back propagation 118 00:10:56,900 --> 00:10:57,350 works. 119 00:10:59,440 --> 00:11:03,070 Here's a single neuron with two inputs, X1 and x2. 120 00:11:05,580 --> 00:11:07,940 It first calculate linearly. 121 00:11:08,150 --> 00:11:10,330 That is, it will calculate the value of Z. 122 00:11:10,670 --> 00:11:15,080 Which is equal to W one, X one plus W2 x2 plus B1. 123 00:11:17,680 --> 00:11:20,950 It then plays a sigmoid on this value of ze. 124 00:11:24,920 --> 00:11:29,150 This sigmoid of the is the predicted output of this neuron. 125 00:11:30,530 --> 00:11:37,670 We used this predicted output with the actual output to get the error of this particular training example. 126 00:11:40,860 --> 00:11:42,280 So let's start with step one. 127 00:11:42,730 --> 00:11:48,160 Step one is we have to randomly initialize Devalues, offbeat and byas. 128 00:11:49,460 --> 00:11:50,420 We have to wait. 129 00:11:50,960 --> 00:12:00,230 And one byas we randomly initialize w one to be two, W2 is equal to three and bias value is equal to 130 00:12:00,230 --> 00:12:00,940 minus four. 131 00:12:04,850 --> 00:12:07,880 Now, the second step is forward propagation. 132 00:12:08,300 --> 00:12:15,950 That is, we will take one training example and put the input values of that training example to get 133 00:12:16,040 --> 00:12:17,300 a predicted output. 134 00:12:19,100 --> 00:12:27,230 We have taken this training example in which X1 value is 10 x two, value is minus four and the output 135 00:12:27,230 --> 00:12:28,040 is one. 136 00:12:28,520 --> 00:12:30,560 This Y is the actual output. 137 00:12:31,040 --> 00:12:32,030 And it is equal to one. 138 00:12:33,110 --> 00:12:35,180 So we have the W one value. 139 00:12:35,990 --> 00:12:37,150 We have X1. 140 00:12:37,700 --> 00:12:39,720 We have W2, x2 and B1. 141 00:12:39,920 --> 00:12:41,180 So we can calculate Z. 142 00:12:42,080 --> 00:12:46,400 We put all these values to get a Z value of food. 143 00:12:48,320 --> 00:12:49,850 We apply the activation function. 144 00:12:50,060 --> 00:12:52,550 That is the sigmoid function on this WilliamsI. 145 00:12:53,430 --> 00:13:02,790 To get predicted output of this neuron to sigmoidal zie that a sigmoid of food gives a predicted output 146 00:13:02,790 --> 00:13:04,390 of zero point nine two. 147 00:13:06,240 --> 00:13:11,730 This predicted output value is the widest value that we will use in the edited function. 148 00:13:13,700 --> 00:13:18,020 You can see that this rally was already very close to the actual output, which is one. 149 00:13:19,000 --> 00:13:21,380 But let's see how we can improve this value. 150 00:13:23,490 --> 00:13:25,880 Now, the third step is a calculation. 151 00:13:26,300 --> 00:13:27,920 We have the error function with us. 152 00:13:28,670 --> 00:13:30,590 We have predicted output value. 153 00:13:30,770 --> 00:13:34,040 That is Bardash as zero point nine eight two. 154 00:13:34,700 --> 00:13:38,240 And we have the actual output value for the training example. 155 00:13:38,330 --> 00:13:41,180 As one, we put these two values. 156 00:13:41,360 --> 00:13:47,210 And this error function to get a final added value of zero point zero zero seven nine. 157 00:13:51,520 --> 00:13:54,820 Now, county foot step, which is back propagation. 158 00:13:56,710 --> 00:14:00,820 The next few minutes are going to be a little heavy on mathematics. 159 00:14:00,970 --> 00:14:03,500 We will cover some basics of calculus here. 160 00:14:05,230 --> 00:14:08,590 If you're not comfortable with this part, it is still OK. 161 00:14:09,070 --> 00:14:12,010 This is happening in the background and your software is handling this. 162 00:14:12,340 --> 00:14:18,730 But if you have some understanding of calculus, looking at this example, we'll tell you how a neuron 163 00:14:18,760 --> 00:14:19,250 is doing. 164 00:14:19,270 --> 00:14:20,200 Back propagation. 165 00:14:21,760 --> 00:14:27,550 Do not worry if you do not understand this, because this is happening in the background and your software 166 00:14:27,550 --> 00:14:28,300 is handling this. 167 00:14:29,380 --> 00:14:32,950 It is good to have this intuition if you know a little bit of mathematics. 168 00:14:35,310 --> 00:14:37,530 So let's see how to do backward propagation. 169 00:14:39,000 --> 00:14:40,110 We are at the end. 170 00:14:40,380 --> 00:14:49,440 We have calculated it at the first step is finding out the slope of error with up predicted output. 171 00:14:50,400 --> 00:14:51,300 That is by dash. 172 00:14:53,550 --> 00:14:54,420 This symbol here. 173 00:14:54,540 --> 00:15:00,980 Delta Ebi Delta whiteknight simply means that we are finding the instantaneous slope of error. 174 00:15:01,440 --> 00:15:04,920 With respect to wildlife, keeping everything else constant. 175 00:15:06,140 --> 00:15:12,770 So if you know calculus, you can find a derivative of this function with respect to white ash, then 176 00:15:12,890 --> 00:15:14,150 lay is equal to one. 177 00:15:15,170 --> 00:15:19,670 This gives us an output of minus one by white ash. 178 00:15:20,930 --> 00:15:29,750 We go further back in that network and we find out the slope of our output function with respect to 179 00:15:29,870 --> 00:15:30,260 Z. 180 00:15:32,200 --> 00:15:34,330 The output function is a sigmoid function. 181 00:15:35,230 --> 00:15:39,910 The slope of sigmoid function with respect to Z is this value. 182 00:15:40,750 --> 00:15:46,630 It is to burn minus Zi upon bone plus it is about minus C, the whole square. 183 00:15:48,950 --> 00:15:54,320 If you no differentiation, you can differentiate this function with respect to see and you will get 184 00:15:54,620 --> 00:15:56,630 this value of Sloup. 185 00:15:58,780 --> 00:16:08,270 Lastly, we find a differential of G with respect to W1, W2 and B. So Z was equal to W one times X1 186 00:16:08,560 --> 00:16:11,070 plus W two times X2 plus B1. 187 00:16:11,800 --> 00:16:19,540 So when we find out the differential this back to W when we get X1, which is equal to 10 at this current 188 00:16:19,540 --> 00:16:29,470 point for W2, we get X2, which is equal to minus four and four B. B we get a slope of one. 189 00:16:33,190 --> 00:16:36,790 Next comes the process of combining all of this. 190 00:16:37,600 --> 00:16:41,370 We moved back in our network to find all these slopes. 191 00:16:41,890 --> 00:16:49,650 But the slope we are actually interested in is how does the edit function change with respect to W one? 192 00:16:50,290 --> 00:16:52,440 How does it change with respect to W2? 193 00:16:52,630 --> 00:16:59,020 And how does it change what is meant to be to find the differential of E respect to W one? 194 00:16:59,950 --> 00:17:06,580 We apply gene pool, which means that if you want to find differential of E respect to W one, you can 195 00:17:06,820 --> 00:17:14,800 instead find differential of E restricted by Desh multiplied with the eventual abiders, respect to 196 00:17:15,160 --> 00:17:20,310 the multiplied with differential of ze respect to differential of W one. 197 00:17:21,790 --> 00:17:24,910 We have calculated all these three values in our last slide. 198 00:17:25,210 --> 00:17:29,260 You can see on the top here we know the value of white ash. 199 00:17:29,470 --> 00:17:32,710 We know the value of Z for this particular training example. 200 00:17:33,250 --> 00:17:40,090 We can put all these values and calculate this differential and it comes out to be minus zero point 201 00:17:40,120 --> 00:17:40,930 one eight six. 202 00:17:45,320 --> 00:17:49,520 We can do the similar exercise for W2 am for. 203 00:17:49,940 --> 00:17:59,990 Also for the differential of E respect to W2 comes out to be zero point zero seven four six and the 204 00:17:59,990 --> 00:18:05,090 differential of evils to be comes out to be minus zero point zero one eight six. 205 00:18:06,470 --> 00:18:13,460 Now, these three differentials are the unit steps that we are going to take in the direction of our 206 00:18:13,460 --> 00:18:14,000 descent. 207 00:18:15,440 --> 00:18:19,640 These are the Delta W ones, the line W 2s and Delta beats. 208 00:18:21,320 --> 00:18:26,210 We are going to use these Delta values to update our rates and biases. 209 00:18:27,860 --> 00:18:33,800 So that we move in the direction we are defined, the loss would be less than the loss that we had earlier. 210 00:18:35,390 --> 00:18:37,190 This brings us to the last step. 211 00:18:37,970 --> 00:18:47,540 Last step is we have to update W and B, the new W one would be previous W one minus Alpha Times, Delta 212 00:18:47,540 --> 00:18:52,430 W one previous W one was two Alpha. 213 00:18:52,520 --> 00:18:53,620 We have taken as five. 214 00:18:54,290 --> 00:19:00,790 We have taken a learning rate of five year and we calculated Delta W one as minus zero point one eight 215 00:19:00,790 --> 00:19:01,220 six. 216 00:19:03,040 --> 00:19:06,480 This updates are w in value to two point ninety. 217 00:19:08,890 --> 00:19:13,260 Similarly, we calculate W2 value and it comes out to be two point six. 218 00:19:14,050 --> 00:19:17,740 And we update B value and it is now minus three point nine. 219 00:19:18,640 --> 00:19:25,210 You can compare the previous and new W1 W2 B values earlier W one was two. 220 00:19:25,500 --> 00:19:30,040 Nowadays, two point ninety earlier, W two was three. 221 00:19:30,310 --> 00:19:31,750 Nowadays, two point six. 222 00:19:32,770 --> 00:19:34,470 Earlier B was minus four. 223 00:19:34,690 --> 00:19:36,430 Nowadays, minus three point nine. 224 00:19:39,460 --> 00:19:46,720 Now, since we have updated our W's and B values, we have to go back to step two. 225 00:19:47,050 --> 00:19:54,160 We have to reiterate, we have to do forward propagation again and we will calculate the predicted output 226 00:19:54,190 --> 00:19:54,550 again. 227 00:19:55,840 --> 00:20:04,420 So this is the training example, X one is ten, x two is minus four, Y is one, we put these values 228 00:20:04,510 --> 00:20:06,550 with our updated weights and byas. 229 00:20:07,590 --> 00:20:15,000 This time, Disney values come out to be fourteen point seven when we apply our activation function 230 00:20:15,030 --> 00:20:18,840 on this evalu we get the predicted output value. 231 00:20:18,960 --> 00:20:21,630 That is why dash as zero point nine nine nine. 232 00:20:23,550 --> 00:20:27,690 If you remember last time we got a predicted value of zero point nine eight two. 233 00:20:28,410 --> 00:20:34,350 So clearly this is an improvement over the last values of abusing these. 234 00:20:36,500 --> 00:20:39,470 This process is repeated several times. 235 00:20:39,770 --> 00:20:41,740 They'll be get minimum error. 236 00:20:44,370 --> 00:20:52,020 If we have a lot of neurons in our network, the same processes followed in forward propagation, we 237 00:20:52,020 --> 00:20:55,650 go to the end to find our predicted output value. 238 00:20:56,370 --> 00:20:59,400 We use that predicted output value to find the loss. 239 00:21:00,330 --> 00:21:01,800 Then we step wise. 240 00:21:02,170 --> 00:21:11,730 Come back, do these differentials, find the individual differential values with ed function, and 241 00:21:11,730 --> 00:21:18,180 then we update our VATE and biases so that the final edit is reduced. 242 00:21:20,380 --> 00:21:27,400 Again, I will repeat that I understand that this lecture was a little mathematics heavy, but if you 243 00:21:27,400 --> 00:21:30,650 have some background calculus, I'm sure you would have understood it. 244 00:21:31,240 --> 00:21:36,880 But if you do not have any background in calculus, I understand that you would be facing some difficulty 245 00:21:36,880 --> 00:21:38,920 in following all the things that I said. 246 00:21:40,860 --> 00:21:43,190 Try to listen to this lecture again. 247 00:21:43,260 --> 00:21:48,720 If you are facing difficulty, if you're still unable to follow the concept here. 248 00:21:49,530 --> 00:21:50,190 Do not worry. 249 00:21:51,270 --> 00:21:54,690 You can still implement a neural network in a software tool. 250 00:21:55,530 --> 00:21:58,980 All this mathematical calculation will be done by this software tool. 251 00:21:59,220 --> 00:22:01,740 And you do not have to do anything on your own. 252 00:22:02,820 --> 00:22:04,380 That is the beauty of neural networks. 253 00:22:04,470 --> 00:22:08,190 If you have to do it with hand, it will take a lot of pain. 254 00:22:08,850 --> 00:22:15,630 But with computers, you can have millions of neurons and millions of features and your computer will 255 00:22:15,630 --> 00:22:16,840 still be able to solnik. 256 00:22:19,890 --> 00:22:21,630 So do focus on the practical lecture. 257 00:22:21,770 --> 00:22:26,060 That is where you will learn how to implement these neural networks in this offer to.