1 00:00:00,420 --> 00:00:07,490 In this lesson we're going to talk about compiling our model remember how we talked about the tensor 2 00:00:07,500 --> 00:00:12,050 flow three step process when working with models in the last lesson. 3 00:00:12,060 --> 00:00:13,760 We've defined our model. 4 00:00:13,800 --> 00:00:19,500 This meant specifying the number of layers specifying the number of neurons and specifying the type 5 00:00:19,500 --> 00:00:23,090 of activation functions inside those neurons. 6 00:00:23,100 --> 00:00:27,790 In other words we laid out the structure of our neural network. 7 00:00:27,810 --> 00:00:36,210 Now it's step 2 and in step 2 we compiled a model and this means telling tensor flow about the kind 8 00:00:36,210 --> 00:00:39,530 of calculations it will have to do down the line. 9 00:00:39,630 --> 00:00:41,540 Why do we have to do this. 10 00:00:41,670 --> 00:00:48,180 We have to do this because tensor flow needs to create its graph behind the scenes. 11 00:00:48,330 --> 00:00:55,830 We briefly touched upon the graph in the previous module where we were working with a pre train model. 12 00:00:55,830 --> 00:01:02,510 The graph is important because tensor flow needs to know how to organize its calculations. 13 00:01:02,880 --> 00:01:04,920 What kind of calculations am I talking about. 14 00:01:05,700 --> 00:01:09,870 Well tensor flow needs to calculate the loss for example right. 15 00:01:09,870 --> 00:01:16,800 It needs to calculate how far away the model's prediction was from the true value and also tensor flow 16 00:01:16,800 --> 00:01:21,120 will need to update the weights as the model is being trained. 17 00:01:21,120 --> 00:01:25,560 And also there might be all sorts of other calculations that you're doing along the way. 18 00:01:25,560 --> 00:01:30,940 For example you might want to track the accuracy of the model during the training process. 19 00:01:30,960 --> 00:01:34,780 That way you can see how the accuracy improves over time. 20 00:01:34,890 --> 00:01:39,920 All of these calculations are things that we have to tell tensor flow about beforehand. 21 00:01:40,140 --> 00:01:48,480 And this is what it means to compile a model using Charisse one of the most important kind of calculations 22 00:01:48,810 --> 00:01:55,260 that tensor flow needs to know about is the kind of loss that it's calculating when we are compiling 23 00:01:55,260 --> 00:01:55,740 our model. 24 00:01:55,740 --> 00:02:03,930 We have to specify what the loss or the cost function is that we want to use in our modules on regression 25 00:02:04,020 --> 00:02:05,380 and on gradient descent. 26 00:02:05,400 --> 00:02:12,220 We've already talked in detail about one particular kind of loss function namely the means squared error. 27 00:02:12,510 --> 00:02:19,050 It had a formula that looked something like this and we even plotted our loss on a chart along with 28 00:02:19,050 --> 00:02:22,380 the steps that we took to minimize the loss. 29 00:02:22,380 --> 00:02:27,900 However the mean square error is not the only loss function out there. 30 00:02:27,900 --> 00:02:33,300 Just like how there were multiple types of activation functions there are also different kinds of lost 31 00:02:33,300 --> 00:02:36,610 functions for different kinds of problems. 32 00:02:36,780 --> 00:02:38,880 In this case we're not doing a regression. 33 00:02:39,000 --> 00:02:45,310 We're doing a classification with multiple classes and we're getting a probability for each class. 34 00:02:45,330 --> 00:02:48,290 Remember that soft max function in our output layer. 35 00:02:48,540 --> 00:02:55,320 That's where we're getting that probability from the lost function that best matches this kind of task 36 00:02:55,600 --> 00:03:03,810 is called categorical cross entropy the cross entropy loss measures the performance of a classification 37 00:03:03,810 --> 00:03:10,380 model which provides a probability value between 0 and 1 as an output. 38 00:03:10,380 --> 00:03:13,970 This is probably best illustrated with a very quick example. 39 00:03:14,070 --> 00:03:18,930 Let me show you what the cross entropy loss looks like so that we can get a better feel for it. 40 00:03:19,800 --> 00:03:23,970 Imagine we've got this neural network and we have a picture of a cat. 41 00:03:24,000 --> 00:03:30,630 Now we want to know if this picture contains a cat or if it does not contain account half loss in this 42 00:03:30,630 --> 00:03:34,670 case can be graphed on the y axis we've got the cost or the loss. 43 00:03:34,860 --> 00:03:38,610 And on the x axis we've got the predicted probability. 44 00:03:38,610 --> 00:03:40,340 So that's gonna be between 0 and 1. 45 00:03:40,380 --> 00:03:46,350 Either the models has zero percent probability of it being a count or 100 percent probability of there 46 00:03:46,350 --> 00:03:49,170 being a cat in the picture. 47 00:03:49,170 --> 00:03:55,220 Now since we gave the model a picture of a cat we know that the true value is equal to 1. 48 00:03:55,410 --> 00:03:58,100 Our why has the value of 1. 49 00:03:58,440 --> 00:04:01,410 Now the categorical cross entropy loss. 50 00:04:01,410 --> 00:04:06,370 If we were to plotted on this chart would look something like this. 51 00:04:06,620 --> 00:04:14,270 What we see is that the cross entropy loss increases as the predicted probability diverges from the 52 00:04:14,330 --> 00:04:16,550 actual label. 53 00:04:16,610 --> 00:04:22,640 In other words the left hand side of this chart is where the model is predicting an almost zero percent 54 00:04:22,640 --> 00:04:26,730 probability of there being a cat in the picture. 55 00:04:26,750 --> 00:04:30,690 So this is a long way from the true value. 56 00:04:30,740 --> 00:04:36,830 If our model predicts there's only a 1 percent chance of there being a cat in the picture and the actual 57 00:04:36,830 --> 00:04:44,450 label is equal to 1 then this would be a very bad result and would lead to a very high loss value. 58 00:04:44,450 --> 00:04:50,830 On the other hand if the model gets it right the ideal loss in this case should be equal to zero. 59 00:04:50,840 --> 00:04:57,170 This is where the model predicts a hundred percent probability that there is a cat in the picture and 60 00:04:57,170 --> 00:04:59,780 there is indeed a cat. 61 00:04:59,780 --> 00:05:08,000 This is why in this chart as the predicted probability approaches 1 the loss slowly decreases. 62 00:05:08,000 --> 00:05:16,430 So what this function is telling us is that the categorical cross entropy loss really penalizes predictions 63 00:05:16,640 --> 00:05:20,930 that are both confident and wrong. 64 00:05:20,930 --> 00:05:24,740 This is how you can understand this loss function. 65 00:05:24,740 --> 00:05:28,270 Now let me show you the formula for cross entropy. 66 00:05:28,430 --> 00:05:34,640 It's the sum across all the categories of the actual value for the label. 67 00:05:34,670 --> 00:05:44,090 This is the Y which will be either 1 or 0 and then this Y is multiplied by the log of the predicted 68 00:05:44,090 --> 00:05:51,170 probability are y hat and we have a sum in this formula because we've got more than one category. 69 00:05:51,830 --> 00:05:56,390 If there are 10 categories then we're summing 10 different terms. 70 00:05:56,630 --> 00:06:03,570 And if there are two categories then we're only summing two different terms so see the actual value 71 00:06:03,630 --> 00:06:04,620 is equal to one. 72 00:06:04,680 --> 00:06:11,040 Like we've got in this example we've got a picture of a cat and the model predicted a probability of 73 00:06:11,040 --> 00:06:14,160 one of there being a cat in the picture. 74 00:06:14,160 --> 00:06:18,390 What would be the loss in this case so looking at this formula. 75 00:06:18,790 --> 00:06:20,940 It's Y is equal to 1. 76 00:06:21,040 --> 00:06:30,180 So it's 1 times log of 1 because y hat in this case is also equal to 1 the log of 1 is equal to zero. 77 00:06:30,700 --> 00:06:32,940 But then we've got the other category. 78 00:06:32,950 --> 00:06:33,540 Right. 79 00:06:33,580 --> 00:06:34,490 No cat. 80 00:06:34,540 --> 00:06:42,180 So either cat or no cat in this case log of y hat log of zero is equal to 1. 81 00:06:42,280 --> 00:06:46,390 But why in this case is equal to zero. 82 00:06:46,660 --> 00:06:49,280 Because there is indeed no cat. 83 00:06:49,310 --> 00:06:54,050 This is how you would essentially substitute the values into this formula. 84 00:06:54,230 --> 00:07:01,430 So even though the name for this lost function looks very long and very scary the actual formula isn't 85 00:07:01,430 --> 00:07:02,990 so complicated. 86 00:07:03,290 --> 00:07:09,380 Now that we've talked about the loss function and we've decided on the loss function it's time to figure 87 00:07:09,380 --> 00:07:17,120 out how those weights are actually adjusted and we moved down to lost function to minimize loss as the 88 00:07:17,120 --> 00:07:22,170 algorithm is training on the data in the module on gradient descent. 89 00:07:22,280 --> 00:07:29,990 What we've seen is how the gradient descent algorithm adjusts the weights and minimizes the loss. 90 00:07:29,990 --> 00:07:38,510 Now gradient descent is not bad but there has been a lot of research into how this algorithm could be 91 00:07:38,510 --> 00:07:44,750 improved or be made more efficient and how its shortcomings could be addressed. 92 00:07:44,750 --> 00:07:51,290 When all these researchers were looking for was essentially a better way to optimize the cost and the 93 00:07:51,290 --> 00:07:57,350 result of this research is that we now have a variety of optimizes to choose from we are spoiled for 94 00:07:57,350 --> 00:07:58,870 choice once again. 95 00:07:59,780 --> 00:08:02,100 So what do I mean by optimize them. 96 00:08:02,160 --> 00:08:10,700 Well you can think of an optimizer as an algorithm that calculates the loss and adjusts the weights. 97 00:08:10,700 --> 00:08:15,200 Well one of the questions you might ask at this point is well what kind of shortcomings might gradient 98 00:08:15,200 --> 00:08:21,200 descent have that you know incentivized all these researchers to try to come up with a better solution. 99 00:08:22,100 --> 00:08:29,240 Well speed for example would be one criteria that you could look at if you've got lots and lots of training 100 00:08:29,240 --> 00:08:32,110 data and you've got a big complex model. 101 00:08:32,360 --> 00:08:39,320 Then the optimizer can make a huge difference and it might be the difference between waiting for some 102 00:08:39,410 --> 00:08:47,720 hours to finish training or waiting for a couple of days to finish training to look at our menu of optimizes 103 00:08:47,930 --> 00:08:54,650 we can once again go to the Karas documentation and scroll down and take a look at what we've got available 104 00:08:54,650 --> 00:08:56,120 to us. 105 00:08:56,120 --> 00:09:05,360 There we can see we've got stochastic gradient descent arms prop add a grad add a Delta and these are 106 00:09:05,360 --> 00:09:11,840 all really mysterious sounding names and they're definitely quite a few different optimizes out there 107 00:09:12,980 --> 00:09:14,540 so which one do you choose. 108 00:09:15,470 --> 00:09:24,800 Well I can tell you this a incredibly popular state of the art optimizer is called Adam and Adam is 109 00:09:24,800 --> 00:09:28,450 also the optimizer that we will use as part of this course. 110 00:09:28,520 --> 00:09:36,350 You see Adam was presented in a research paper back in 2015 and the reason that I really like the Adam 111 00:09:36,350 --> 00:09:43,150 optimizer is that it is very computationally efficient and it has very low memory requirements. 112 00:09:43,400 --> 00:09:49,460 So this means that even a less powerful computer can train a model and not wait too long. 113 00:09:49,460 --> 00:09:56,570 Also another really nice feature of Adam is is that it requires very very little configuration or to 114 00:09:56,570 --> 00:09:58,360 quote the late Steve Jobs. 115 00:09:58,430 --> 00:10:06,490 It just works now that we've talked about our loss function and our optimizer let's head back into Jupiter 116 00:10:06,490 --> 00:10:10,630 notebook and compile our model. 117 00:10:10,630 --> 00:10:16,480 I'm actually going to do this in this cell right here just below the code where we define our model 118 00:10:16,510 --> 00:10:25,080 and lay out the structure the way we can compile our model using carris is by typing the name of a model 119 00:10:25,240 --> 00:10:30,820 so model on a square one but a dot and then right compile. 120 00:10:30,880 --> 00:10:35,590 Now when we open the parentheses we have to specify three things. 121 00:10:35,590 --> 00:10:41,710 The optimizer that we want to use the lost function that we want to use and which metrics to calculate 122 00:10:42,610 --> 00:10:44,140 for the optimize them. 123 00:10:44,140 --> 00:10:45,900 We said it would use atom. 124 00:10:46,020 --> 00:10:48,910 So that's optimizer equals single quotes. 125 00:10:48,920 --> 00:10:55,460 Atom for the loss we're gonna use the categorical cross entropy. 126 00:10:55,570 --> 00:11:02,830 Now Kerry's actually gives us a slightly more computationally efficient variation of the categorical 127 00:11:02,830 --> 00:11:04,510 cross entropy. 128 00:11:04,510 --> 00:11:12,450 And this is the sparse underscore categorical cross entropy. 129 00:11:12,520 --> 00:11:18,790 This loss function works pretty much the same way as the categorical cross entropy but it's slightly 130 00:11:18,790 --> 00:11:22,000 more computationally efficient. 131 00:11:22,010 --> 00:11:28,560 Finally we're going to specify the metrics that we want Chris to calculate as we're training our model. 132 00:11:28,630 --> 00:11:37,930 So metrics is gonna be equal to square brackets single quotes accuracy and that's it. 133 00:11:37,930 --> 00:11:41,520 We've compiled our model in a single line of code. 134 00:11:41,620 --> 00:11:43,240 I'm going to split this up a little bit. 135 00:11:43,450 --> 00:11:51,580 So it's a bit easier to read and just make sure that there's really no typos in this line. 136 00:11:51,580 --> 00:11:53,350 But of course the proof is in the pudding. 137 00:11:53,350 --> 00:11:56,770 So let me hit shift enter and see if I've made an error. 138 00:11:56,770 --> 00:11:59,320 Looks like we're all good. 139 00:11:59,320 --> 00:12:04,960 So now that we've outlined the structure of our model and we've compiled it successfully let's actually 140 00:12:04,960 --> 00:12:08,060 take a quick look at it in our Jupiter notebook. 141 00:12:08,260 --> 00:12:16,000 So there's a very neat little function from Caris called summary so model on a square one dot summary 142 00:12:16,540 --> 00:12:24,030 and shift into will show us what our model actually looks like and what we see here. 143 00:12:24,060 --> 00:12:30,970 Three columns we see the layers that we've got we've got one two three four layers. 144 00:12:31,300 --> 00:12:34,780 And here we see the output shape of each layer. 145 00:12:34,780 --> 00:12:39,760 This should correspond to the number of units that you've outlined here when you were defining your 146 00:12:39,760 --> 00:12:40,770 model. 147 00:12:40,780 --> 00:12:45,040 So our first layer has 128 neurons. 148 00:12:45,070 --> 00:12:52,220 The second layer 64 and the last layer has 10 neurons 10 different outputs. 149 00:12:52,240 --> 00:12:58,570 And over here we see the number of parameters in each model and the total number of parameters. 150 00:12:58,690 --> 00:13:02,720 In this case we've got about 400000 parameters. 151 00:13:03,000 --> 00:13:07,580 Now one thing that might seem a little strange is the names of these layers. 152 00:13:07,590 --> 00:13:07,930 Right. 153 00:13:08,010 --> 00:13:11,080 Dense on the scale five dense undisclosed six. 154 00:13:11,130 --> 00:13:17,340 And this has to do with the fact that if I go to the cell here hit shift enter again and now refresh 155 00:13:17,340 --> 00:13:25,110 my summary the names of the leaves change the update because they're auto generated given that this 156 00:13:25,110 --> 00:13:26,950 is a little bit confusing. 157 00:13:27,000 --> 00:13:33,270 One of the things that we can do is we can actually give these layers a name as well so we can come 158 00:13:33,270 --> 00:13:42,000 back up to where we defined our model and say name is equal to single quotes and one on a score hidden 159 00:13:42,000 --> 00:13:42,380 one. 160 00:13:43,260 --> 00:13:54,380 And I'm going to give my other layers a name as well and 1 hidden 2 1 and 3 and m 1 output if I hit 161 00:13:54,380 --> 00:14:00,380 shift enter on the cell again and I come down here to my summary refresh it then then you'll see that 162 00:14:00,380 --> 00:14:05,640 these names are now updated to the names that we specified here. 163 00:14:05,900 --> 00:14:07,390 So that's quite handy right. 164 00:14:08,780 --> 00:14:14,960 But there's one other big conceptual topic that I want to talk about now and this has to do with the 165 00:14:14,960 --> 00:14:22,180 structure of our neural network because so far we've looked at it a little bit of a simplistic way and 166 00:14:22,270 --> 00:14:27,200 now has a good chance to appreciate it a bit more fully. 167 00:14:27,200 --> 00:14:32,500 Let's work out where this total number of parameters actually comes from. 168 00:14:32,540 --> 00:14:38,690 Why is it so high and can we calculate this manually in order to better understand all the things that 169 00:14:38,690 --> 00:14:43,970 have to be estimated for our neural network in our introductory lectures. 170 00:14:43,970 --> 00:14:50,180 We talked a little bit about how important the connection weights were for the neurons and we talked 171 00:14:50,180 --> 00:14:56,420 about how the number of connections really grew with the size and complexity of the model. 172 00:14:57,230 --> 00:15:05,430 And this in turn had an impact on how much data was required to accurately estimate all of these parameters. 173 00:15:05,570 --> 00:15:12,430 This very simple model here on this slide had about 90 different connections between the neurons. 174 00:15:12,440 --> 00:15:17,720 Now just because there are 90 different connections 90 different weights to estimate doesn't mean this 175 00:15:17,720 --> 00:15:25,250 is actually the total number of parameters you see the thing is individual neurons don't actually just 176 00:15:25,250 --> 00:15:31,460 have weights individual neurons also have a bias. 177 00:15:31,500 --> 00:15:35,070 Now at this point I'd like you to make another mental leap. 178 00:15:35,150 --> 00:15:42,380 The key thing when talking about neurons in a neural network are actually their activation functions 179 00:15:43,070 --> 00:15:49,070 the activation functions determine how strong in neuron will fire. 180 00:15:49,070 --> 00:15:54,010 As a matter of fact a neuron kind of is the activation function. 181 00:15:54,440 --> 00:16:02,270 So when we talked about learning and changing the connection weights what we're actually doing is changing 182 00:16:02,390 --> 00:16:11,840 the activation function for example an activation function might become steeper or flatter. 183 00:16:11,840 --> 00:16:16,840 This is what changing the weights on the learning step actually translates into. 184 00:16:17,060 --> 00:16:24,330 It affects the shape of this activation function and as this activation function changes it of course 185 00:16:24,420 --> 00:16:27,870 affects how strong that neuron files now. 186 00:16:27,870 --> 00:16:32,880 Changing the weights makes an activation function steep or flat. 187 00:16:32,880 --> 00:16:37,530 What about shifting the activation function from say left to right. 188 00:16:37,530 --> 00:16:40,390 Well that's what the bias does. 189 00:16:40,410 --> 00:16:50,070 The bias is what can shift the entire curve so if these two things the weights and the bias and individual 190 00:16:50,070 --> 00:16:53,740 activation function can be manipulated it can be stretched. 191 00:16:53,820 --> 00:17:00,990 It can be made steep or flat or moved up or moved down or moved left or moved right. 192 00:17:00,990 --> 00:17:07,140 All of these are just parameters that the neural network can update as it learns. 193 00:17:07,230 --> 00:17:14,760 So let's come back into Jupiter notebook and make sense of the total number of parameters in this summary. 194 00:17:14,820 --> 00:17:21,780 Let's work it out for each individual layer that very first layer has about three hundred and ninety 195 00:17:21,780 --> 00:17:27,150 three thousand three hundred and forty four different parameters. 196 00:17:27,150 --> 00:17:28,680 Why is that. 197 00:17:28,680 --> 00:17:35,650 Well the size of our inputs we said was 32 times 32 times three. 198 00:17:35,820 --> 00:17:37,660 Right. 199 00:17:37,770 --> 00:17:43,300 And then we had one hundred and twenty eight neurons in this layer. 200 00:17:43,380 --> 00:17:47,340 So all of that time's one hundred and twenty eight. 201 00:17:47,880 --> 00:17:54,660 So all of this will give us the number of connection weights and that's equal to three hundred and ninety 202 00:17:54,660 --> 00:17:58,340 three thousand two hundred and sixteen. 203 00:17:58,500 --> 00:17:59,430 So what's missing. 204 00:17:59,940 --> 00:18:06,120 Well it's our bias parameters you see each neuron doesn't just have weights. 205 00:18:06,210 --> 00:18:08,350 It also has a bias. 206 00:18:08,370 --> 00:18:16,710 So if we add one hundred and twenty eight to this total then we arrive at three hundred and ninety three 207 00:18:16,710 --> 00:18:22,560 thousand three hundred and forty four to arrive at the grand total. 208 00:18:22,980 --> 00:18:27,530 We have to do this calculation for the three remaining layers. 209 00:18:27,570 --> 00:18:39,990 In other words Layer number two is gonna be equal to 128 inputs times sixty four neurons plus another 210 00:18:39,990 --> 00:18:44,160 60 for biased terms for each single neuron. 211 00:18:44,190 --> 00:18:56,340 The second hidden layer the third hidden layer is just going to be 64 inputs times 60 neurons plus 16 212 00:18:56,340 --> 00:18:59,300 biased terms one for each neuron. 213 00:18:59,460 --> 00:19:06,300 And of course that finally is gonna be 16 inputs to the final layer from the third hidden layer. 214 00:19:06,300 --> 00:19:15,960 Times 10 neurons plus 10 biased terms hitting shift enter on the cell gives us our answer our grand 215 00:19:15,960 --> 00:19:22,850 total four hundred and two thousand eight hundred and ten brilliant. 216 00:19:22,870 --> 00:19:31,720 This was a very dense lesson pun intended but such as the way and we're going to be moving on to greater 217 00:19:31,720 --> 00:19:38,860 things namely we're gonna set setup tensor board which will allow us to watch what our model is doing 218 00:19:39,070 --> 00:19:42,860 while it's training and then we're going to train our model. 219 00:19:42,970 --> 00:19:49,660 So there's some really cool stuff coming up in the next lessons and looking forward to seeing you there. 220 00:19:49,720 --> 00:19:50,290 Take care.