1 00:00:00,266 --> 00:00:02,966 Hello and welcome back to the course on Deep Learning. 2 00:00:02,966 --> 00:00:05,566 This is an additional tutorial. 3 00:00:05,566 --> 00:00:08,533 to talk about the softmax and cross entropy functions. 4 00:00:08,533 --> 00:00:09,766 It is not 100. 5 00:00:09,766 --> 00:00:12,300 Percent necessary in order for you to. 6 00:00:12,300 --> 00:00:13,300 Go through. 7 00:00:13,300 --> 00:00:17,200 All of the parts that we've been through in the, main 8 00:00:17,200 --> 00:00:21,033 part of this section where we're talking about the, convolutional neural networks. 9 00:00:21,166 --> 00:00:21,900 But at the same time, 10 00:00:21,900 --> 00:00:26,466 I thought it would be a good addition to your bag of knowledge and skill sets. 11 00:00:26,466 --> 00:00:28,800 So, let's go ahead and. 12 00:00:28,800 --> 00:00:30,666 Dig. Into these functions. 13 00:00:30,666 --> 00:00:32,400 So to start off with. 14 00:00:32,400 --> 00:00:33,166 What we have here 15 00:00:33,166 --> 00:00:37,800 is the convolutional neural network that we built in the main part of the. 16 00:00:37,800 --> 00:00:38,666 Section. 17 00:00:38,666 --> 00:00:42,533 And then at the end it pops out some probabilities 18 00:00:42,766 --> 00:00:47,700 for 0.95 for a dog and 0.055% for a cat. 19 00:00:47,900 --> 00:00:50,900 Given that photo in the left as an input. 20 00:00:50,900 --> 00:00:52,500 This is after the training has been conducted. 21 00:00:52,500 --> 00:00:54,766 This is actually it's running and it's 22 00:00:55,766 --> 00:00:57,233 classifying a certain image. 23 00:00:57,233 --> 00:00:58,766 And so the question here is how. 24 00:00:58,766 --> 00:01:00,766 Come these two values add up to one? 25 00:01:00,766 --> 00:01:03,500 Because as far as we know, from everything that we've learned 26 00:01:03,500 --> 00:01:08,166 about artificial neural networks, there is nothing to say that these two, 27 00:01:08,733 --> 00:01:11,533 final neurons are connected between each other. 28 00:01:11,533 --> 00:01:15,066 So how would they know what the value of that hold? 29 00:01:15,066 --> 00:01:17,233 Each one of them know what the value of the other one is, 30 00:01:17,233 --> 00:01:20,200 and how would they know to add their values up to one? 31 00:01:20,200 --> 00:01:22,200 Well, the answer is they wouldn't. 32 00:01:22,200 --> 00:01:25,900 in the classic, version of artificial neural network, 33 00:01:26,033 --> 00:01:28,500 and the only way that they. Do is because we. 34 00:01:28,500 --> 00:01:29,900 Introduce a special function 35 00:01:29,900 --> 00:01:33,566 called the softmax function in order to help us out of the situation. 36 00:01:33,800 --> 00:01:37,500 So normally what would happen is, the dog and cat 37 00:01:37,500 --> 00:01:40,700 neurons would have any kind of real values that, 38 00:01:41,400 --> 00:01:44,633 they don't have to be, they don't have to add up to one. 39 00:01:45,033 --> 00:01:46,733 But then we would apply 40 00:01:46,733 --> 00:01:50,466 the softmax function, which is written up over there at the top. 41 00:01:50,766 --> 00:01:54,300 And that would bring these values to be between 0 and 1. 42 00:01:54,300 --> 00:01:56,233 And it would make them add up to one. 43 00:01:56,233 --> 00:01:57,733 And to. Quote Wikipedia. 44 00:01:59,100 --> 00:02:00,000 the softmax 45 00:02:00,000 --> 00:02:03,166 function or the normalized exponential function is a generalization 46 00:02:03,166 --> 00:02:06,866 of the logistic function that, quote unquote, squashes 47 00:02:07,133 --> 00:02:11,166 a k dimensional vector of arbitrary real values to a k dimensional 48 00:02:11,166 --> 00:02:15,266 vector of real values in the range of 0 to 1 that add up to one. 49 00:02:15,266 --> 00:02:17,533 So basically it does exactly what we want. 50 00:02:17,533 --> 00:02:18,633 It brings these values 51 00:02:18,633 --> 00:02:22,366 to be between 0 and 1, and makes sure that they add up to one. 52 00:02:22,800 --> 00:02:26,433 And the way it works is that the way that is this is possible. 53 00:02:26,433 --> 00:02:29,800 Is that because at the bottom over here, you can see that there's a summation. 54 00:02:29,800 --> 00:02:32,800 So it takes the exponent, 55 00:02:32,800 --> 00:02:36,466 and puts it in, the power of z and adds it up. 56 00:02:36,466 --> 00:02:39,500 So z one that two across all of your classes, all of these values 57 00:02:39,833 --> 00:02:43,266 and so that's your normalization happening right there. 58 00:02:44,233 --> 00:02:47,300 So that's, how the softmax function works. 59 00:02:47,300 --> 00:02:49,066 And it makes sense to. 60 00:02:49,066 --> 00:02:53,000 Introduce the softmax function into convolutional neural networks. 61 00:02:53,000 --> 00:02:55,566 Because how strange would it. 62 00:02:55,566 --> 00:02:59,933 Be if you had a, possible classes of a dog and a cat? 63 00:02:59,933 --> 00:03:05,066 And for the dog class, you had, probability of 80%. 64 00:03:05,066 --> 00:03:08,066 And for the cat class, you had a probability of 45%. 65 00:03:08,300 --> 00:03:11,266 Right. It just doesn't make sense like that. 66 00:03:11,266 --> 00:03:14,700 And therefore it's much better when you introduce a softmax function. 67 00:03:14,700 --> 00:03:16,333 And that's what you will find happening 68 00:03:16,333 --> 00:03:19,333 most of the time in convolutional neural networks. 69 00:03:19,600 --> 00:03:21,500 Now the other thing is that the. 70 00:03:21,500 --> 00:03:23,266 Softmax function, 71 00:03:23,266 --> 00:03:27,000 comes hand in hand with something called the cross entropy function. 72 00:03:27,400 --> 00:03:28,933 And it's a very handy thing for us. 73 00:03:28,933 --> 00:03:30,466 So let's first look at the formula. 74 00:03:30,466 --> 00:03:33,033 This is what the cross entropy function looks like. 75 00:03:33,033 --> 00:03:36,933 We're actually going to be using a different calculation. 76 00:03:36,933 --> 00:03:39,333 I'm going to be using this representation of the cross entropy. 77 00:03:39,333 --> 00:03:40,566 But the results are basically the same. 78 00:03:40,566 --> 00:03:42,433 This is just easier to calculate. 79 00:03:42,433 --> 00:03:44,433 And what I know this. 80 00:03:44,433 --> 00:03:47,766 Might sound very, unrelated to anything right now. 81 00:03:47,766 --> 00:03:51,766 Just formulas on your screen, but, there will be some additional recommended 82 00:03:51,766 --> 00:03:55,433 reading at the end of this section, so don't worry if you're not, picking up 83 00:03:55,433 --> 00:03:58,466 on the math, like, if I, if we haven't explained the math right now. But. 84 00:03:59,066 --> 00:04:01,700 the point here is that what is the cross entropy? 85 00:04:01,700 --> 00:04:03,566 Well, a cross entropy function. 86 00:04:03,566 --> 00:04:05,400 Remember how we previously. 87 00:04:05,400 --> 00:04:08,766 In artificial neural networks, we had a function 88 00:04:09,100 --> 00:04:12,366 called the mean squared error function. 89 00:04:12,366 --> 00:04:15,433 Which we used as the cost function for. 90 00:04:15,566 --> 00:04:17,700 Assessing our network performance. 91 00:04:17,700 --> 00:04:23,466 And our goal was to minimize the MSE in order to optimize our network performance. 92 00:04:23,766 --> 00:04:25,366 Well that was our cost function. 93 00:04:25,366 --> 00:04:26,700 Then there, there. 94 00:04:26,700 --> 00:04:30,500 And in, convolutional neural networks, you can. 95 00:04:30,500 --> 00:04:30,766 Still. 96 00:04:30,766 --> 00:04:34,200 Use MSE, but a better option in convolutional 97 00:04:34,200 --> 00:04:37,366 neural networks after you apply the softmax function. 98 00:04:37,500 --> 00:04:39,666 Turns out to be the. Cross entropy function. 99 00:04:39,666 --> 00:04:43,933 And in, convolutional neural networks, when you apply the cross 100 00:04:43,933 --> 00:04:44,833 entropy function, it's not. 101 00:04:44,833 --> 00:04:46,500 Cost called the cost. Function anymore. 102 00:04:46,500 --> 00:04:48,166 It's called the loss function. 103 00:04:48,166 --> 00:04:49,366 And they're very similar. 104 00:04:49,366 --> 00:04:52,133 They're just little terminological differences. 105 00:04:52,133 --> 00:04:55,433 And like little, a bit different in what they mean. 106 00:04:55,433 --> 00:04:57,800 But for our purposes, it's pretty much. 107 00:04:57,800 --> 00:04:59,533 The same thing. And. 108 00:04:59,533 --> 00:05:02,266 what happens is the loss. Function. 109 00:05:02,266 --> 00:05:03,866 Is, again. 110 00:05:03,866 --> 00:05:06,266 something that we want to minimize in order 111 00:05:06,266 --> 00:05:09,366 to maximize the performance of our network. 112 00:05:09,533 --> 00:05:10,766 So let's have a look. 113 00:05:10,766 --> 00:05:12,266 look at a quick example. 114 00:05:12,266 --> 00:05:15,166 On how of how this, function can be applied. 115 00:05:15,166 --> 00:05:16,800 So let's say we've. 116 00:05:16,800 --> 00:05:19,400 Put an image of a dog into our network. 117 00:05:19,400 --> 00:05:24,400 the predicted value for dog is 0.9, and this is during the training. 118 00:05:24,400 --> 00:05:27,166 So we know that we know the label that is a dog. 119 00:05:27,166 --> 00:05:29,333 So the predicted value is 0.9. 120 00:05:29,333 --> 00:05:32,233 The predicted value for cat is 0.1. 121 00:05:32,233 --> 00:05:34,066 Then here we have the label. So we know 122 00:05:34,066 --> 00:05:37,566 it's a dog because this is training and zero one for dog, zero for cat. 123 00:05:37,800 --> 00:05:39,533 And so in. This case. 124 00:05:39,533 --> 00:05:42,200 You need to use 125 00:05:43,300 --> 00:05:43,900 you need to plug 126 00:05:43,900 --> 00:05:47,300 these numbers into your formula for the cross entropy. 127 00:05:47,666 --> 00:05:51,633 So how you do it is, the values on the left 128 00:05:51,633 --> 00:05:52,766 go into the variable 129 00:05:52,766 --> 00:05:56,700 Q, the one that is under the logarithm in the, on the right side. 130 00:05:56,700 --> 00:05:59,366 And the values from the right would go into p. 131 00:05:59,366 --> 00:06:00,966 And so it's important to remember. 132 00:06:00,966 --> 00:06:02,100 Which one goes there where. 133 00:06:02,100 --> 00:06:05,400 Because if you get. Them wrong, you don't want to be taking a logarithm 134 00:06:05,400 --> 00:06:09,466 from a, from a zero value and or a logarithm from a one. 135 00:06:09,466 --> 00:06:11,666 So you just want to plug them in. 136 00:06:11,666 --> 00:06:13,633 make sure you plug them in to the correct. 137 00:06:13,633 --> 00:06:14,700 places. 138 00:06:14,700 --> 00:06:16,933 And then you basically add that up. 139 00:06:16,933 --> 00:06:19,366 So that's how the cross entropy works. 140 00:06:19,366 --> 00:06:21,933 And we'll look at it actually, right now we're just going to look 141 00:06:21,933 --> 00:06:26,633 at a specific step by step example of applying this function in real life. 142 00:06:26,633 --> 00:06:30,200 And it'll kind of make make more sense what cross-entropy is. 143 00:06:30,200 --> 00:06:32,266 And it'll be less like. 144 00:06:32,266 --> 00:06:36,333 My goal in this tutorial is to make you more comfortable with cross-entropy, 145 00:06:36,333 --> 00:06:41,733 because it can sound very convoluted and no pun intended. 146 00:06:42,800 --> 00:06:44,033 it can like. 147 00:06:44,033 --> 00:06:45,566 Convolutional neural networks. 148 00:06:45,566 --> 00:06:48,566 It it can sound very complex, right? 149 00:06:48,800 --> 00:06:50,700 Scary. But it's not. 150 00:06:50,700 --> 00:06:51,566 That's that's the point. 151 00:06:51,566 --> 00:06:54,000 So let's go ahead and apply it just so we know that it's not scary. 152 00:06:54,000 --> 00:06:57,200 So Here's neural net. And also this will. 153 00:06:57,200 --> 00:06:59,233 Explain why we're doing this. 154 00:06:59,233 --> 00:07:01,600 Why we're looking at different cost functions. 155 00:07:01,600 --> 00:07:02,900 So neural network one. 156 00:07:02,900 --> 00:07:05,633 Neural network two let's say we have two neural networks. 157 00:07:05,633 --> 00:07:07,766 And then we pass an image of a dog. 158 00:07:07,766 --> 00:07:11,700 and we know that this is a dog and not a cat. 159 00:07:12,033 --> 00:07:16,833 And then we have another image of a cat, this time an animal. 160 00:07:16,833 --> 00:07:17,833 And it's a cat, not a dog. 161 00:07:17,833 --> 00:07:21,833 And here we have a weird looking animal, which is in fact a dog, 162 00:07:21,866 --> 00:07:23,900 not a cat, if you look very closely. 163 00:07:23,900 --> 00:07:28,300 so we want to see what our neural networks will predict in the first case. 164 00:07:28,333 --> 00:07:31,066 Neural network 190% dog. 165 00:07:31,066 --> 00:07:33,200 10% cats. Correct. 166 00:07:33,200 --> 00:07:36,433 Neural network number 260% dog, 40% cat. 167 00:07:36,600 --> 00:07:38,800 Still correct. Worse, but correct. 168 00:07:40,133 --> 00:07:41,800 second option. 169 00:07:41,800 --> 00:07:44,533 first neural network 10% cat. 170 00:07:44,533 --> 00:07:47,233 Dog 90%. Cats. Correct. 171 00:07:47,233 --> 00:07:49,100 neural network number 230%. 172 00:07:49,100 --> 00:07:51,400 Dog 70%. Cat. 173 00:07:51,400 --> 00:07:53,400 Worse, but still correct. 174 00:07:53,400 --> 00:07:55,266 And then finally neural network one. 175 00:07:55,266 --> 00:08:00,266 And, you know, in image three, neural Network one, 40% dog, 60% cat. 176 00:08:00,566 --> 00:08:01,766 Incorrect. 177 00:08:01,766 --> 00:08:04,100 Neural network number. 210%. 178 00:08:04,100 --> 00:08:06,966 Dog, 90% cat incorrect. 179 00:08:06,966 --> 00:08:08,100 And worse. 180 00:08:08,100 --> 00:08:10,633 So the key here is that even though. 181 00:08:10,633 --> 00:08:12,900 both networks got it wrong in the last one. 182 00:08:12,900 --> 00:08:15,800 Throughout all three images, neural. 183 00:08:15,800 --> 00:08:18,800 Network one was outperforming neural network two. 184 00:08:18,800 --> 00:08:22,533 So even in the last case, it was very, 185 00:08:23,200 --> 00:08:27,300 it had a it gave dog like a 40% chance as opposed to neural network. 186 00:08:27,300 --> 00:08:29,033 Two only gave dog a 10% chance. 187 00:08:29,033 --> 00:08:32,266 So neural network one is outperforming across the board, 188 00:08:32,700 --> 00:08:35,466 when compared to neural network two. 189 00:08:35,466 --> 00:08:37,633 And so now we're going to look at. 190 00:08:37,633 --> 00:08:39,266 The functions 191 00:08:39,266 --> 00:08:42,566 that they can measure performance that we've kind of talked about already. 192 00:08:42,900 --> 00:08:44,733 So let's put. These into a table. 193 00:08:44,733 --> 00:08:46,300 So this is neural network one. 194 00:08:46,300 --> 00:08:48,200 you have the row number. 195 00:08:48,200 --> 00:08:49,400 So that's the image number. 196 00:08:49,400 --> 00:08:53,700 And then for image one you have what it predicted 90% dog 10% cat. 197 00:08:53,700 --> 00:08:55,433 So those are the hat variables. 198 00:08:55,433 --> 00:08:57,266 And then you have the actual values. 199 00:08:57,266 --> 00:09:00,266 So dog correct. Cat incorrect. 200 00:09:00,366 --> 00:09:04,633 Same thing for image number two and same thing for image number three. 201 00:09:05,100 --> 00:09:07,600 And same for neural network number two. 202 00:09:07,600 --> 00:09:10,966 So, dogs 60%, cats 40% in the first image. 203 00:09:10,966 --> 00:09:12,033 That's what it predicted. 204 00:09:12,033 --> 00:09:15,033 Correct answer is dog, not a cat. And so on. 205 00:09:15,033 --> 00:09:17,233 And so now let's see what errors we. 206 00:09:17,233 --> 00:09:17,966 Can actually get. 207 00:09:17,966 --> 00:09:21,333 So what errors we can calculate to estimate 208 00:09:21,333 --> 00:09:24,333 the performance and monitor the performance of our networks. 209 00:09:24,800 --> 00:09:26,966 So one type of error is. 210 00:09:26,966 --> 00:09:28,500 Called the classification. Error. 211 00:09:28,500 --> 00:09:30,966 And that is basically just. 212 00:09:30,966 --> 00:09:33,900 Asking it did you get it right or not. 213 00:09:33,900 --> 00:09:37,166 Regardless of the probabilities it's just did you get. It right or did 214 00:09:37,166 --> 00:09:37,833 you not get it right. 215 00:09:37,833 --> 00:09:40,400 So in. Both cases and for both. 216 00:09:40,400 --> 00:09:41,500 Neural networks, 217 00:09:41,500 --> 00:09:46,200 each of them, they got one or so this is how many they got wrong. 218 00:09:46,200 --> 00:09:48,366 So they got one out of three wrong. 219 00:09:48,366 --> 00:09:51,933 So 33% error rate, for neural network 220 00:09:51,933 --> 00:09:54,933 one and 33% error rate for neural network two. 221 00:09:54,933 --> 00:09:56,866 And so basically from this standpoint. 222 00:09:56,866 --> 00:09:59,066 Both neural networks perform at the same level. 223 00:09:59,066 --> 00:10:00,066 But we know that's not true. 224 00:10:00,066 --> 00:10:03,966 We know that neural network one is outperforming neural network two. 225 00:10:05,000 --> 00:10:07,833 That's why a classification error is not a good, 226 00:10:07,833 --> 00:10:10,833 measure, especially for the purposes of backpropagation. 227 00:10:11,500 --> 00:10:13,366 mean squared error. 228 00:10:13,366 --> 00:10:13,700 different. 229 00:10:13,700 --> 00:10:16,700 And by the way, I did these calculations, in Excel. 230 00:10:16,833 --> 00:10:18,333 I just didn't want to bore you with them, 231 00:10:18,333 --> 00:10:21,900 but you can totally just sit down and do them on a paper or an Excel. 232 00:10:21,900 --> 00:10:23,600 These are very straightforward calculations. 233 00:10:23,600 --> 00:10:28,000 Just basically take the, sum of squared errors 234 00:10:28,000 --> 00:10:32,800 and then just take the average across your, across your observations. 235 00:10:32,800 --> 00:10:33,966 And that's pretty much it. 236 00:10:33,966 --> 00:10:38,700 so for the, for neural net network one, you get 25% 237 00:10:38,966 --> 00:10:43,233 for neural network, two you get 71% error rate. 238 00:10:43,233 --> 00:10:45,866 So as you can see this one is more accurate. 239 00:10:45,866 --> 00:10:48,866 It's telling us that neural network one has a much lower error 240 00:10:48,866 --> 00:10:50,000 rate than neural network two. 241 00:10:51,000 --> 00:10:52,866 And then cross entropy again. 242 00:10:52,866 --> 00:10:54,866 We've seen the formula. You can also calculate this. 243 00:10:54,866 --> 00:10:56,633 This is actually even easier to calculate. 244 00:10:56,633 --> 00:10:57,966 The mean squared error. 245 00:10:57,966 --> 00:11:02,200 Cross error cross-entropy gives you 38% for neural network one 246 00:11:02,400 --> 00:11:05,366 and 1.06 for neural network two. 247 00:11:05,366 --> 00:11:08,133 So you can see the results are a bit different. 248 00:11:08,133 --> 00:11:11,133 when you look at them like that, when you look at, 249 00:11:11,600 --> 00:11:17,200 you know, the mean squared error and cross entropy, The question of 250 00:11:17,200 --> 00:11:20,900 why would you use cross entropy over, 251 00:11:21,600 --> 00:11:25,733 mean squared error isn't just about. 252 00:11:25,733 --> 00:11:28,600 The kind of like the numbers that they spit. Out. This these calculations. 253 00:11:28,600 --> 00:11:30,633 Were just to show you that this. 254 00:11:30,633 --> 00:11:33,600 Is all it's all. Doable. You can just do it on a paper. 255 00:11:33,600 --> 00:11:37,800 It's it's not these are not very intense mathematics. 256 00:11:37,800 --> 00:11:38,366 These are. 257 00:11:38,366 --> 00:11:41,100 The pretty. Simple, straightforward things. 258 00:11:41,100 --> 00:11:44,466 But the question of why would you use mean, cross 259 00:11:44,466 --> 00:11:46,166 entropy over mean squared error? 260 00:11:46,166 --> 00:11:48,133 It's a very, very good question to ask. 261 00:11:48,133 --> 00:11:49,166 I'm glad you asked it. 262 00:11:49,166 --> 00:11:52,166 the the answer to that is like 263 00:11:52,166 --> 00:11:57,066 there's several advantages of, 264 00:11:57,066 --> 00:12:01,300 cross entropy over mean squared error, which are not obvious. 265 00:12:01,300 --> 00:12:05,366 And so I'll, I'll mention a couple, but then. 266 00:12:05,366 --> 00:12:07,066 I'll, I'll let you know where you can find out more. 267 00:12:07,066 --> 00:12:13,533 So One of them is that if, if, for instance, you're at the very start. 268 00:12:13,533 --> 00:12:16,600 Of your, Backpropagation, 269 00:12:16,900 --> 00:12:21,133 your output value is very, very, very, very tiny. 270 00:12:21,133 --> 00:12:22,233 Very. Tiny. 271 00:12:22,233 --> 00:12:23,533 So it's much smaller. 272 00:12:23,533 --> 00:12:25,566 Than the actual value that you. Want. 273 00:12:25,566 --> 00:12:28,400 Then at the very start, the gradient 274 00:12:28,400 --> 00:12:31,266 in your gradient descent will be very, very low. 275 00:12:31,266 --> 00:12:35,433 And you it won't be enough, it'll be very hard for. 276 00:12:35,466 --> 00:12:38,933 The neural network to actually. Start doing. 277 00:12:38,933 --> 00:12:41,700 Something and start moving around and start adjusting those weights. 278 00:12:41,700 --> 00:12:43,666 And So you start actually. 279 00:12:43,666 --> 00:12:45,000 Moving in the right direction. 280 00:12:45,000 --> 00:12:46,966 Whereas when you use something. Like the. 281 00:12:46,966 --> 00:12:50,133 Cross entropy, because it's got that logarithm in it, it 282 00:12:50,133 --> 00:12:54,533 actually, helps the network assess even. 283 00:12:54,533 --> 00:12:57,433 A small error like that and just do something about it. 284 00:12:57,433 --> 00:12:58,433 Here's how to. Think about it. 285 00:12:58,433 --> 00:13:03,166 So let's say, in again, this is very and in very intuitive approach. 286 00:13:03,166 --> 00:13:04,833 There's this, there's going to be 287 00:13:04,833 --> 00:13:07,833 a link to the mathematics, and you can derive these things 288 00:13:07,833 --> 00:13:09,400 through the mathematics in more detail. 289 00:13:09,400 --> 00:13:12,400 But a very intuitive approach. Let's say. 290 00:13:12,633 --> 00:13:13,366 your, 291 00:13:14,400 --> 00:13:14,833 like your. 292 00:13:14,833 --> 00:13:17,566 Outcome that you want is. Is one. 293 00:13:17,566 --> 00:13:22,666 And right now you are at, one, one millionth of one. 294 00:13:22,666 --> 00:13:25,166 Right. So 0.000001. 295 00:13:25,166 --> 00:13:28,000 And then you improve next. 296 00:13:28,000 --> 00:13:32,400 Time you improve your outcome from, from one millionth to, 1,000th. 297 00:13:32,700 --> 00:13:37,566 And in terms of if you calculate this squared error, you just. 298 00:13:37,566 --> 00:13:40,800 Subtracting one from the other, or basically in each case 299 00:13:40,800 --> 00:13:43,800 you're calculating the square. Error and you'll see that the squared errors, 300 00:13:43,800 --> 00:13:48,033 when you compare one case versus the other, it didn't change that much. 301 00:13:48,033 --> 00:13:49,266 You didn't improve your. 302 00:13:49,266 --> 00:13:51,966 Network that much when you're looking at the mean squared error. 303 00:13:51,966 --> 00:13:55,233 But if you're looking at cross entropy 304 00:13:55,233 --> 00:13:58,800 because you're taking a logarithm and then you're comparing the. 305 00:13:58,800 --> 00:14:02,433 Two dividing one by the other, You will see 306 00:14:02,433 --> 00:14:06,066 that you have actually improved your network significantly. 307 00:14:06,066 --> 00:14:10,966 So you that that jump from, one millionth to 1,000th in mean 308 00:14:10,966 --> 00:14:15,233 squared error terms will be very low, it will be insignificant, and it won't, 309 00:14:15,733 --> 00:14:18,300 it won't guide your gradient, 310 00:14:18,300 --> 00:14:21,966 boosting process or your backpropagation in the right direction. 311 00:14:21,966 --> 00:14:24,133 It will, it will it will guide it in the right direction. 312 00:14:24,133 --> 00:14:26,666 But it'll be like a very slow guidance. 313 00:14:26,666 --> 00:14:29,466 It won't have enough power. 314 00:14:29,466 --> 00:14:30,066 whereas. 315 00:14:30,066 --> 00:14:32,933 If you do it through cross entropy, cross entropy will. 316 00:14:32,933 --> 00:14:35,400 Understand that. Oh, even though these are very small. 317 00:14:35,400 --> 00:14:38,400 Adjustments that are just, you know, making. 318 00:14:38,400 --> 00:14:43,500 A tiny change in absolute terms in relative terms, it's a huge improvement. 319 00:14:43,733 --> 00:14:46,000 And we we are definitely going in the right direction. 320 00:14:46,000 --> 00:14:47,133 Let's keep going that way. 321 00:14:47,133 --> 00:14:50,700 So cross entropy will help your neural network. 322 00:14:52,666 --> 00:14:55,800 Get to the right gets to the optimal state. 323 00:14:56,700 --> 00:15:01,000 it's a better way for the neural network to get to get it to an optimal state. 324 00:15:01,000 --> 00:15:02,100 But, bear in mind. 325 00:15:02,100 --> 00:15:06,433 That this only works when, the cross entropy is only the preferred. 326 00:15:06,533 --> 00:15:08,166 Method, only for classification. 327 00:15:08,166 --> 00:15:09,133 So, if. 328 00:15:09,133 --> 00:15:11,266 You're talking about things like regression. 329 00:15:11,266 --> 00:15:13,800 Like which we had in artificial neural networks. 330 00:15:13,800 --> 00:15:15,733 then you would rather. 331 00:15:15,733 --> 00:15:17,400 Go with mean squared error. 332 00:15:17,400 --> 00:15:18,000 Whereas cross. 333 00:15:18,000 --> 00:15:20,533 Entropy is better for classification. 334 00:15:20,533 --> 00:15:23,600 And again, it has to do with the fact that we're using softmax function. 335 00:15:23,600 --> 00:15:26,600 So that's a kind of intuitive explanation of that. 336 00:15:26,900 --> 00:15:29,533 a good place to learn a bit more about that if you're. 337 00:15:29,533 --> 00:15:33,466 Really interested in, you know, why are we using, cross entropy versus 338 00:15:33,466 --> 00:15:34,233 mean squared error? 339 00:15:34,233 --> 00:15:38,266 Google a video by Geoffrey Hinton called the. 340 00:15:38,266 --> 00:15:40,533 Softmax output function. 341 00:15:40,533 --> 00:15:42,800 And, he explains it very well. 342 00:15:42,800 --> 00:15:43,666 And, you know, being 343 00:15:43,666 --> 00:15:47,600 the godfather of deep learning, who can explain it better anyway? 344 00:15:47,900 --> 00:15:50,033 and by. The way, any video. 345 00:15:50,033 --> 00:15:51,600 By Geoffrey Hinton is golden. 346 00:15:51,600 --> 00:15:54,000 He's just got a huge talent for explaining things. 347 00:15:55,166 --> 00:15:57,233 Anyway, so that's, that's. 348 00:15:57,233 --> 00:15:58,533 Softmax versus cross. Entropy. 349 00:15:58,533 --> 00:16:00,766 I hope that gives you kind of like an intuitive. 350 00:16:00,766 --> 00:16:02,200 Understanding of what's going on here, but. 351 00:16:02,200 --> 00:16:06,300 More importantly, that you're not put off by the term cross entropy, 352 00:16:06,400 --> 00:16:08,966 because Hudlin. Will mention it in the practical tutorials. 353 00:16:08,966 --> 00:16:11,133 And I wanted to make sure that you're prepared for that. 354 00:16:11,133 --> 00:16:12,866 And it's it's just another. 355 00:16:12,866 --> 00:16:16,266 Way of calculating your loss function and another way 356 00:16:16,266 --> 00:16:19,733 of optimizing your network, which is specifically tailored to, 357 00:16:20,266 --> 00:16:23,533 classification problems and therefore convolutional neural 358 00:16:23,533 --> 00:16:27,533 networks and comes in hand, hand in hand with the softmax function. 359 00:16:28,133 --> 00:16:31,700 So additional reading if you'd like a light introduction 360 00:16:31,700 --> 00:16:35,233 into, cross entropy if you're interested. 361 00:16:35,233 --> 00:16:36,366 In the cross entropy of a bit. More. 362 00:16:36,366 --> 00:16:40,266 Of course, a good article to check out is called A Friendly Introduction 363 00:16:40,266 --> 00:16:45,266 to Cross Entropy Loss by Rob deep, 2016. 364 00:16:45,266 --> 00:16:47,033 Here's the link below. 365 00:16:47,033 --> 00:16:48,100 very, very nice. 366 00:16:48,100 --> 00:16:50,400 Very soft. 367 00:16:50,400 --> 00:16:52,000 Nothing. No. 368 00:16:52,000 --> 00:16:53,933 Super complex math. 369 00:16:53,933 --> 00:16:56,100 good analogies, good examples. 370 00:16:56,100 --> 00:16:57,433 Use analogies of cars. 371 00:16:57,433 --> 00:17:00,066 And you look at cars and talks about information and bits 372 00:17:00,066 --> 00:17:03,233 and restrictions, and, you know, how would you encode this? 373 00:17:03,233 --> 00:17:05,300 How do you code that? It's it's a it's a good article to. 374 00:17:05,300 --> 00:17:05,766 Have a look at. 375 00:17:05,766 --> 00:17:08,766 And we'll give you a, a good overview of, cross entropy. 376 00:17:09,000 --> 00:17:11,766 like from an introductory standpoint. 377 00:17:11,766 --> 00:17:12,800 If you want to dig. 378 00:17:12,800 --> 00:17:17,466 Into the heavy math, like what you see here, then check out an article 379 00:17:17,500 --> 00:17:22,466 by or a blog by How to Implement a Neural Network Intermezzo two. 380 00:17:22,466 --> 00:17:25,600 So Intermezzo is like is like an intermediate thing, like a. 381 00:17:26,800 --> 00:17:28,333 interim intermittent. 382 00:17:28,333 --> 00:17:32,066 In, you know, like when you go to a theater and you have like a break, 383 00:17:32,633 --> 00:17:35,966 between, the first part and the second part. 384 00:17:36,133 --> 00:17:38,933 So because he's like, going through all these steps and then he's like. 385 00:17:38,933 --> 00:17:41,733 And then he says, I got to explain this first. 386 00:17:41,733 --> 00:17:44,000 and yeah. So that's why it's called Intermezzo. 387 00:17:44,000 --> 00:17:46,733 No, the reason as far as I understand, the. 388 00:17:46,733 --> 00:17:49,133 Article is by Peter Rowlands. 389 00:17:49,133 --> 00:17:50,633 20. 16 as well. 390 00:17:50,633 --> 00:17:53,633 So both are quite recent and Yeah. 391 00:17:53,700 --> 00:17:57,433 Check out this if you would like to dig into the mathematics behind, 392 00:17:57,833 --> 00:18:02,200 cross entropy behind softmax and cross entropy in this article actually. 393 00:18:02,733 --> 00:18:03,700 So there we go. 394 00:18:03,700 --> 00:18:07,200 That's, all there is to these two. 395 00:18:07,200 --> 00:18:11,966 Hopefully I was able to add some additional clarity and, good luck 396 00:18:11,966 --> 00:18:12,633 with that. 397 00:18:12,633 --> 00:18:16,833 It's, It's going to be fun and, enjoy the practical tutorials. 398 00:18:16,833 --> 00:18:19,633 I'll see you next time. Until then, enjoy deep learning.