In this lesson we're going to talk about compiling our model remember how we talked about the tensor flow three step process when working with models in the last lesson. We've defined our model. This meant specifying the number of layers specifying the number of neurons and specifying the type of activation functions inside those neurons. In other words we laid out the structure of our neural network. Now it's step 2 and in step 2 we compiled a model and this means telling tensor flow about the kind of calculations it will have to do down the line. Why do we have to do this. We have to do this because tensor flow needs to create its graph behind the scenes. We briefly touched upon the graph in the previous module where we were working with a pre train model. The graph is important because tensor flow needs to know how to organize its calculations. What kind of calculations am I talking about. Well tensor flow needs to calculate the loss for example right. It needs to calculate how far away the model's prediction was from the true value and also tensor flow will need to update the weights as the model is being trained. And also there might be all sorts of other calculations that you're doing along the way. For example you might want to track the accuracy of the model during the training process. That way you can see how the accuracy improves over time. All of these calculations are things that we have to tell tensor flow about beforehand. And this is what it means to compile a model using Charisse one of the most important kind of calculations that tensor flow needs to know about is the kind of loss that it's calculating when we are compiling our model. We have to specify what the loss or the cost function is that we want to use in our modules on regression and on gradient descent. We've already talked in detail about one particular kind of loss function namely the means squared error. It had a formula that looked something like this and we even plotted our loss on a chart along with the steps that we took to minimize the loss. However the mean square error is not the only loss function out there. Just like how there were multiple types of activation functions there are also different kinds of lost functions for different kinds of problems. In this case we're not doing a regression. We're doing a classification with multiple classes and we're getting a probability for each class. Remember that soft max function in our output layer. That's where we're getting that probability from the lost function that best matches this kind of task is called categorical cross entropy the cross entropy loss measures the performance of a classification model which provides a probability value between 0 and 1 as an output. This is probably best illustrated with a very quick example. Let me show you what the cross entropy loss looks like so that we can get a better feel for it. Imagine we've got this neural network and we have a picture of a cat. Now we want to know if this picture contains a cat or if it does not contain account half loss in this case can be graphed on the y axis we've got the cost or the loss. And on the x axis we've got the predicted probability. So that's gonna be between 0 and 1. Either the models has zero percent probability of it being a count or 100 percent probability of there being a cat in the picture. Now since we gave the model a picture of a cat we know that the true value is equal to 1. Our why has the value of 1. Now the categorical cross entropy loss. If we were to plotted on this chart would look something like this. What we see is that the cross entropy loss increases as the predicted probability diverges from the actual label. In other words the left hand side of this chart is where the model is predicting an almost zero percent probability of there being a cat in the picture. So this is a long way from the true value. If our model predicts there's only a 1 percent chance of there being a cat in the picture and the actual label is equal to 1 then this would be a very bad result and would lead to a very high loss value. On the other hand if the model gets it right the ideal loss in this case should be equal to zero. This is where the model predicts a hundred percent probability that there is a cat in the picture and there is indeed a cat. This is why in this chart as the predicted probability approaches 1 the loss slowly decreases. So what this function is telling us is that the categorical cross entropy loss really penalizes predictions that are both confident and wrong. This is how you can understand this loss function. Now let me show you the formula for cross entropy. It's the sum across all the categories of the actual value for the label. This is the Y which will be either 1 or 0 and then this Y is multiplied by the log of the predicted probability are y hat and we have a sum in this formula because we've got more than one category. If there are 10 categories then we're summing 10 different terms. And if there are two categories then we're only summing two different terms so see the actual value is equal to one. Like we've got in this example we've got a picture of a cat and the model predicted a probability of one of there being a cat in the picture. What would be the loss in this case so looking at this formula. It's Y is equal to 1. So it's 1 times log of 1 because y hat in this case is also equal to 1 the log of 1 is equal to zero. But then we've got the other category. Right. No cat. So either cat or no cat in this case log of y hat log of zero is equal to 1. But why in this case is equal to zero. Because there is indeed no cat. This is how you would essentially substitute the values into this formula. So even though the name for this lost function looks very long and very scary the actual formula isn't so complicated. Now that we've talked about the loss function and we've decided on the loss function it's time to figure out how those weights are actually adjusted and we moved down to lost function to minimize loss as the algorithm is training on the data in the module on gradient descent. What we've seen is how the gradient descent algorithm adjusts the weights and minimizes the loss. Now gradient descent is not bad but there has been a lot of research into how this algorithm could be improved or be made more efficient and how its shortcomings could be addressed. When all these researchers were looking for was essentially a better way to optimize the cost and the result of this research is that we now have a variety of optimizes to choose from we are spoiled for choice once again. So what do I mean by optimize them. Well you can think of an optimizer as an algorithm that calculates the loss and adjusts the weights. Well one of the questions you might ask at this point is well what kind of shortcomings might gradient descent have that you know incentivized all these researchers to try to come up with a better solution. Well speed for example would be one criteria that you could look at if you've got lots and lots of training data and you've got a big complex model. Then the optimizer can make a huge difference and it might be the difference between waiting for some hours to finish training or waiting for a couple of days to finish training to look at our menu of optimizes we can once again go to the Karas documentation and scroll down and take a look at what we've got available to us. There we can see we've got stochastic gradient descent arms prop add a grad add a Delta and these are all really mysterious sounding names and they're definitely quite a few different optimizes out there so which one do you choose. Well I can tell you this a incredibly popular state of the art optimizer is called Adam and Adam is also the optimizer that we will use as part of this course. You see Adam was presented in a research paper back in 2015 and the reason that I really like the Adam optimizer is that it is very computationally efficient and it has very low memory requirements. So this means that even a less powerful computer can train a model and not wait too long. Also another really nice feature of Adam is is that it requires very very little configuration or to quote the late Steve Jobs. It just works now that we've talked about our loss function and our optimizer let's head back into Jupiter notebook and compile our model. I'm actually going to do this in this cell right here just below the code where we define our model and lay out the structure the way we can compile our model using carris is by typing the name of a model so model on a square one but a dot and then right compile. Now when we open the parentheses we have to specify three things. The optimizer that we want to use the lost function that we want to use and which metrics to calculate for the optimize them. We said it would use atom. So that's optimizer equals single quotes. Atom for the loss we're gonna use the categorical cross entropy. Now Kerry's actually gives us a slightly more computationally efficient variation of the categorical cross entropy. And this is the sparse underscore categorical cross entropy. This loss function works pretty much the same way as the categorical cross entropy but it's slightly more computationally efficient. Finally we're going to specify the metrics that we want Chris to calculate as we're training our model. So metrics is gonna be equal to square brackets single quotes accuracy and that's it. We've compiled our model in a single line of code. I'm going to split this up a little bit. So it's a bit easier to read and just make sure that there's really no typos in this line. But of course the proof is in the pudding. So let me hit shift enter and see if I've made an error. Looks like we're all good. So now that we've outlined the structure of our model and we've compiled it successfully let's actually take a quick look at it in our Jupiter notebook. So there's a very neat little function from Caris called summary so model on a square one dot summary and shift into will show us what our model actually looks like and what we see here. Three columns we see the layers that we've got we've got one two three four layers. And here we see the output shape of each layer. This should correspond to the number of units that you've outlined here when you were defining your model. So our first layer has 128 neurons. The second layer 64 and the last layer has 10 neurons 10 different outputs. And over here we see the number of parameters in each model and the total number of parameters. In this case we've got about 400000 parameters. Now one thing that might seem a little strange is the names of these layers. Right. Dense on the scale five dense undisclosed six. And this has to do with the fact that if I go to the cell here hit shift enter again and now refresh my summary the names of the leaves change the update because they're auto generated given that this is a little bit confusing. One of the things that we can do is we can actually give these layers a name as well so we can come back up to where we defined our model and say name is equal to single quotes and one on a score hidden one. And I'm going to give my other layers a name as well and 1 hidden 2 1 and 3 and m 1 output if I hit shift enter on the cell again and I come down here to my summary refresh it then then you'll see that these names are now updated to the names that we specified here. So that's quite handy right. But there's one other big conceptual topic that I want to talk about now and this has to do with the structure of our neural network because so far we've looked at it a little bit of a simplistic way and now has a good chance to appreciate it a bit more fully. Let's work out where this total number of parameters actually comes from. Why is it so high and can we calculate this manually in order to better understand all the things that have to be estimated for our neural network in our introductory lectures. We talked a little bit about how important the connection weights were for the neurons and we talked about how the number of connections really grew with the size and complexity of the model. And this in turn had an impact on how much data was required to accurately estimate all of these parameters. This very simple model here on this slide had about 90 different connections between the neurons. Now just because there are 90 different connections 90 different weights to estimate doesn't mean this is actually the total number of parameters you see the thing is individual neurons don't actually just have weights individual neurons also have a bias. Now at this point I'd like you to make another mental leap. The key thing when talking about neurons in a neural network are actually their activation functions the activation functions determine how strong in neuron will fire. As a matter of fact a neuron kind of is the activation function. So when we talked about learning and changing the connection weights what we're actually doing is changing the activation function for example an activation function might become steeper or flatter. This is what changing the weights on the learning step actually translates into. It affects the shape of this activation function and as this activation function changes it of course affects how strong that neuron files now. Changing the weights makes an activation function steep or flat. What about shifting the activation function from say left to right. Well that's what the bias does. The bias is what can shift the entire curve so if these two things the weights and the bias and individual activation function can be manipulated it can be stretched. It can be made steep or flat or moved up or moved down or moved left or moved right. All of these are just parameters that the neural network can update as it learns. So let's come back into Jupiter notebook and make sense of the total number of parameters in this summary. Let's work it out for each individual layer that very first layer has about three hundred and ninety three thousand three hundred and forty four different parameters. Why is that. Well the size of our inputs we said was 32 times 32 times three. Right. And then we had one hundred and twenty eight neurons in this layer. So all of that time's one hundred and twenty eight. So all of this will give us the number of connection weights and that's equal to three hundred and ninety three thousand two hundred and sixteen. So what's missing. Well it's our bias parameters you see each neuron doesn't just have weights. It also has a bias. So if we add one hundred and twenty eight to this total then we arrive at three hundred and ninety three thousand three hundred and forty four to arrive at the grand total. We have to do this calculation for the three remaining layers. In other words Layer number two is gonna be equal to 128 inputs times sixty four neurons plus another 60 for biased terms for each single neuron. The second hidden layer the third hidden layer is just going to be 64 inputs times 60 neurons plus 16 biased terms one for each neuron. And of course that finally is gonna be 16 inputs to the final layer from the third hidden layer. Times 10 neurons plus 10 biased terms hitting shift enter on the cell gives us our answer our grand total four hundred and two thousand eight hundred and ten brilliant. This was a very dense lesson pun intended but such as the way and we're going to be moving on to greater things namely we're gonna set setup tensor board which will allow us to watch what our model is doing while it's training and then we're going to train our model. So there's some really cool stuff coming up in the next lessons and looking forward to seeing you there. Take care.