1
00:00:00,420 --> 00:00:07,490
In this lesson we're going to talk about compiling our model remember how we talked about the tensor

2
00:00:07,500 --> 00:00:12,050
flow three step process when working with models in the last lesson.

3
00:00:12,060 --> 00:00:13,760
We've defined our model.

4
00:00:13,800 --> 00:00:19,500
This meant specifying the number of layers specifying the number of neurons and specifying the type

5
00:00:19,500 --> 00:00:23,090
of activation functions inside those neurons.

6
00:00:23,100 --> 00:00:27,790
In other words we laid out the structure of our neural network.

7
00:00:27,810 --> 00:00:36,210
Now it's step 2 and in step 2 we compiled a model and this means telling tensor flow about the kind

8
00:00:36,210 --> 00:00:39,530
of calculations it will have to do down the line.

9
00:00:39,630 --> 00:00:41,540
Why do we have to do this.

10
00:00:41,670 --> 00:00:48,180
We have to do this because tensor flow needs to create its graph behind the scenes.

11
00:00:48,330 --> 00:00:55,830
We briefly touched upon the graph in the previous module where we were working with a pre train model.

12
00:00:55,830 --> 00:01:02,510
The graph is important because tensor flow needs to know how to organize its calculations.

13
00:01:02,880 --> 00:01:04,920
What kind of calculations am I talking about.

14
00:01:05,700 --> 00:01:09,870
Well tensor flow needs to calculate the loss for example right.

15
00:01:09,870 --> 00:01:16,800
It needs to calculate how far away the model's prediction was from the true value and also tensor flow

16
00:01:16,800 --> 00:01:21,120
will need to update the weights as the model is being trained.

17
00:01:21,120 --> 00:01:25,560
And also there might be all sorts of other calculations that you're doing along the way.

18
00:01:25,560 --> 00:01:30,940
For example you might want to track the accuracy of the model during the training process.

19
00:01:30,960 --> 00:01:34,780
That way you can see how the accuracy improves over time.

20
00:01:34,890 --> 00:01:39,920
All of these calculations are things that we have to tell tensor flow about beforehand.

21
00:01:40,140 --> 00:01:48,480
And this is what it means to compile a model using Charisse one of the most important kind of calculations

22
00:01:48,810 --> 00:01:55,260
that tensor flow needs to know about is the kind of loss that it's calculating when we are compiling

23
00:01:55,260 --> 00:01:55,740
our model.

24
00:01:55,740 --> 00:02:03,930
We have to specify what the loss or the cost function is that we want to use in our modules on regression

25
00:02:04,020 --> 00:02:05,380
and on gradient descent.

26
00:02:05,400 --> 00:02:12,220
We've already talked in detail about one particular kind of loss function namely the means squared error.

27
00:02:12,510 --> 00:02:19,050
It had a formula that looked something like this and we even plotted our loss on a chart along with

28
00:02:19,050 --> 00:02:22,380
the steps that we took to minimize the loss.

29
00:02:22,380 --> 00:02:27,900
However the mean square error is not the only loss function out there.

30
00:02:27,900 --> 00:02:33,300
Just like how there were multiple types of activation functions there are also different kinds of lost

31
00:02:33,300 --> 00:02:36,610
functions for different kinds of problems.

32
00:02:36,780 --> 00:02:38,880
In this case we're not doing a regression.

33
00:02:39,000 --> 00:02:45,310
We're doing a classification with multiple classes and we're getting a probability for each class.

34
00:02:45,330 --> 00:02:48,290
Remember that soft max function in our output layer.

35
00:02:48,540 --> 00:02:55,320
That's where we're getting that probability from the lost function that best matches this kind of task

36
00:02:55,600 --> 00:03:03,810
is called categorical cross entropy the cross entropy loss measures the performance of a classification

37
00:03:03,810 --> 00:03:10,380
model which provides a probability value between 0 and 1 as an output.

38
00:03:10,380 --> 00:03:13,970
This is probably best illustrated with a very quick example.

39
00:03:14,070 --> 00:03:18,930
Let me show you what the cross entropy loss looks like so that we can get a better feel for it.

40
00:03:19,800 --> 00:03:23,970
Imagine we've got this neural network and we have a picture of a cat.

41
00:03:24,000 --> 00:03:30,630
Now we want to know if this picture contains a cat or if it does not contain account half loss in this

42
00:03:30,630 --> 00:03:34,670
case can be graphed on the y axis we've got the cost or the loss.

43
00:03:34,860 --> 00:03:38,610
And on the x axis we've got the predicted probability.

44
00:03:38,610 --> 00:03:40,340
So that's gonna be between 0 and 1.

45
00:03:40,380 --> 00:03:46,350
Either the models has zero percent probability of it being a count or 100 percent probability of there

46
00:03:46,350 --> 00:03:49,170
being a cat in the picture.

47
00:03:49,170 --> 00:03:55,220
Now since we gave the model a picture of a cat we know that the true value is equal to 1.

48
00:03:55,410 --> 00:03:58,100
Our why has the value of 1.

49
00:03:58,440 --> 00:04:01,410
Now the categorical cross entropy loss.

50
00:04:01,410 --> 00:04:06,370
If we were to plotted on this chart would look something like this.

51
00:04:06,620 --> 00:04:14,270
What we see is that the cross entropy loss increases as the predicted probability diverges from the

52
00:04:14,330 --> 00:04:16,550
actual label.

53
00:04:16,610 --> 00:04:22,640
In other words the left hand side of this chart is where the model is predicting an almost zero percent

54
00:04:22,640 --> 00:04:26,730
probability of there being a cat in the picture.

55
00:04:26,750 --> 00:04:30,690
So this is a long way from the true value.

56
00:04:30,740 --> 00:04:36,830
If our model predicts there's only a 1 percent chance of there being a cat in the picture and the actual

57
00:04:36,830 --> 00:04:44,450
label is equal to 1 then this would be a very bad result and would lead to a very high loss value.

58
00:04:44,450 --> 00:04:50,830
On the other hand if the model gets it right the ideal loss in this case should be equal to zero.

59
00:04:50,840 --> 00:04:57,170
This is where the model predicts a hundred percent probability that there is a cat in the picture and

60
00:04:57,170 --> 00:04:59,780
there is indeed a cat.

61
00:04:59,780 --> 00:05:08,000
This is why in this chart as the predicted probability approaches 1 the loss slowly decreases.

62
00:05:08,000 --> 00:05:16,430
So what this function is telling us is that the categorical cross entropy loss really penalizes predictions

63
00:05:16,640 --> 00:05:20,930
that are both confident and wrong.

64
00:05:20,930 --> 00:05:24,740
This is how you can understand this loss function.

65
00:05:24,740 --> 00:05:28,270
Now let me show you the formula for cross entropy.

66
00:05:28,430 --> 00:05:34,640
It's the sum across all the categories of the actual value for the label.

67
00:05:34,670 --> 00:05:44,090
This is the Y which will be either 1 or 0 and then this Y is multiplied by the log of the predicted

68
00:05:44,090 --> 00:05:51,170
probability are y hat and we have a sum in this formula because we've got more than one category.

69
00:05:51,830 --> 00:05:56,390
If there are 10 categories then we're summing 10 different terms.

70
00:05:56,630 --> 00:06:03,570
And if there are two categories then we're only summing two different terms so see the actual value

71
00:06:03,630 --> 00:06:04,620
is equal to one.

72
00:06:04,680 --> 00:06:11,040
Like we've got in this example we've got a picture of a cat and the model predicted a probability of

73
00:06:11,040 --> 00:06:14,160
one of there being a cat in the picture.

74
00:06:14,160 --> 00:06:18,390
What would be the loss in this case so looking at this formula.

75
00:06:18,790 --> 00:06:20,940
It's Y is equal to 1.

76
00:06:21,040 --> 00:06:30,180
So it's 1 times log of 1 because y hat in this case is also equal to 1 the log of 1 is equal to zero.

77
00:06:30,700 --> 00:06:32,940
But then we've got the other category.

78
00:06:32,950 --> 00:06:33,540
Right.

79
00:06:33,580 --> 00:06:34,490
No cat.

80
00:06:34,540 --> 00:06:42,180
So either cat or no cat in this case log of y hat log of zero is equal to 1.

81
00:06:42,280 --> 00:06:46,390
But why in this case is equal to zero.

82
00:06:46,660 --> 00:06:49,280
Because there is indeed no cat.

83
00:06:49,310 --> 00:06:54,050
This is how you would essentially substitute the values into this formula.

84
00:06:54,230 --> 00:07:01,430
So even though the name for this lost function looks very long and very scary the actual formula isn't

85
00:07:01,430 --> 00:07:02,990
so complicated.

86
00:07:03,290 --> 00:07:09,380
Now that we've talked about the loss function and we've decided on the loss function it's time to figure

87
00:07:09,380 --> 00:07:17,120
out how those weights are actually adjusted and we moved down to lost function to minimize loss as the

88
00:07:17,120 --> 00:07:22,170
algorithm is training on the data in the module on gradient descent.

89
00:07:22,280 --> 00:07:29,990
What we've seen is how the gradient descent algorithm adjusts the weights and minimizes the loss.

90
00:07:29,990 --> 00:07:38,510
Now gradient descent is not bad but there has been a lot of research into how this algorithm could be

91
00:07:38,510 --> 00:07:44,750
improved or be made more efficient and how its shortcomings could be addressed.

92
00:07:44,750 --> 00:07:51,290
When all these researchers were looking for was essentially a better way to optimize the cost and the

93
00:07:51,290 --> 00:07:57,350
result of this research is that we now have a variety of optimizes to choose from we are spoiled for

94
00:07:57,350 --> 00:07:58,870
choice once again.

95
00:07:59,780 --> 00:08:02,100
So what do I mean by optimize them.

96
00:08:02,160 --> 00:08:10,700
Well you can think of an optimizer as an algorithm that calculates the loss and adjusts the weights.

97
00:08:10,700 --> 00:08:15,200
Well one of the questions you might ask at this point is well what kind of shortcomings might gradient

98
00:08:15,200 --> 00:08:21,200
descent have that you know incentivized all these researchers to try to come up with a better solution.

99
00:08:22,100 --> 00:08:29,240
Well speed for example would be one criteria that you could look at if you've got lots and lots of training

100
00:08:29,240 --> 00:08:32,110
data and you've got a big complex model.

101
00:08:32,360 --> 00:08:39,320
Then the optimizer can make a huge difference and it might be the difference between waiting for some

102
00:08:39,410 --> 00:08:47,720
hours to finish training or waiting for a couple of days to finish training to look at our menu of optimizes

103
00:08:47,930 --> 00:08:54,650
we can once again go to the Karas documentation and scroll down and take a look at what we've got available

104
00:08:54,650 --> 00:08:56,120
to us.

105
00:08:56,120 --> 00:09:05,360
There we can see we've got stochastic gradient descent arms prop add a grad add a Delta and these are

106
00:09:05,360 --> 00:09:11,840
all really mysterious sounding names and they're definitely quite a few different optimizes out there

107
00:09:12,980 --> 00:09:14,540
so which one do you choose.

108
00:09:15,470 --> 00:09:24,800
Well I can tell you this a incredibly popular state of the art optimizer is called Adam and Adam is

109
00:09:24,800 --> 00:09:28,450
also the optimizer that we will use as part of this course.

110
00:09:28,520 --> 00:09:36,350
You see Adam was presented in a research paper back in 2015 and the reason that I really like the Adam

111
00:09:36,350 --> 00:09:43,150
optimizer is that it is very computationally efficient and it has very low memory requirements.

112
00:09:43,400 --> 00:09:49,460
So this means that even a less powerful computer can train a model and not wait too long.

113
00:09:49,460 --> 00:09:56,570
Also another really nice feature of Adam is is that it requires very very little configuration or to

114
00:09:56,570 --> 00:09:58,360
quote the late Steve Jobs.

115
00:09:58,430 --> 00:10:06,490
It just works now that we've talked about our loss function and our optimizer let's head back into Jupiter

116
00:10:06,490 --> 00:10:10,630
notebook and compile our model.

117
00:10:10,630 --> 00:10:16,480
I'm actually going to do this in this cell right here just below the code where we define our model

118
00:10:16,510 --> 00:10:25,080
and lay out the structure the way we can compile our model using carris is by typing the name of a model

119
00:10:25,240 --> 00:10:30,820
so model on a square one but a dot and then right compile.

120
00:10:30,880 --> 00:10:35,590
Now when we open the parentheses we have to specify three things.

121
00:10:35,590 --> 00:10:41,710
The optimizer that we want to use the lost function that we want to use and which metrics to calculate

122
00:10:42,610 --> 00:10:44,140
for the optimize them.

123
00:10:44,140 --> 00:10:45,900
We said it would use atom.

124
00:10:46,020 --> 00:10:48,910
So that's optimizer equals single quotes.

125
00:10:48,920 --> 00:10:55,460
Atom for the loss we're gonna use the categorical cross entropy.

126
00:10:55,570 --> 00:11:02,830
Now Kerry's actually gives us a slightly more computationally efficient variation of the categorical

127
00:11:02,830 --> 00:11:04,510
cross entropy.

128
00:11:04,510 --> 00:11:12,450
And this is the sparse underscore categorical cross entropy.

129
00:11:12,520 --> 00:11:18,790
This loss function works pretty much the same way as the categorical cross entropy but it's slightly

130
00:11:18,790 --> 00:11:22,000
more computationally efficient.

131
00:11:22,010 --> 00:11:28,560
Finally we're going to specify the metrics that we want Chris to calculate as we're training our model.

132
00:11:28,630 --> 00:11:37,930
So metrics is gonna be equal to square brackets single quotes accuracy and that's it.

133
00:11:37,930 --> 00:11:41,520
We've compiled our model in a single line of code.

134
00:11:41,620 --> 00:11:43,240
I'm going to split this up a little bit.

135
00:11:43,450 --> 00:11:51,580
So it's a bit easier to read and just make sure that there's really no typos in this line.

136
00:11:51,580 --> 00:11:53,350
But of course the proof is in the pudding.

137
00:11:53,350 --> 00:11:56,770
So let me hit shift enter and see if I've made an error.

138
00:11:56,770 --> 00:11:59,320
Looks like we're all good.

139
00:11:59,320 --> 00:12:04,960
So now that we've outlined the structure of our model and we've compiled it successfully let's actually

140
00:12:04,960 --> 00:12:08,060
take a quick look at it in our Jupiter notebook.

141
00:12:08,260 --> 00:12:16,000
So there's a very neat little function from Caris called summary so model on a square one dot summary

142
00:12:16,540 --> 00:12:24,030
and shift into will show us what our model actually looks like and what we see here.

143
00:12:24,060 --> 00:12:30,970
Three columns we see the layers that we've got we've got one two three four layers.

144
00:12:31,300 --> 00:12:34,780
And here we see the output shape of each layer.

145
00:12:34,780 --> 00:12:39,760
This should correspond to the number of units that you've outlined here when you were defining your

146
00:12:39,760 --> 00:12:40,770
model.

147
00:12:40,780 --> 00:12:45,040
So our first layer has 128 neurons.

148
00:12:45,070 --> 00:12:52,220
The second layer 64 and the last layer has 10 neurons 10 different outputs.

149
00:12:52,240 --> 00:12:58,570
And over here we see the number of parameters in each model and the total number of parameters.

150
00:12:58,690 --> 00:13:02,720
In this case we've got about 400000 parameters.

151
00:13:03,000 --> 00:13:07,580
Now one thing that might seem a little strange is the names of these layers.

152
00:13:07,590 --> 00:13:07,930
Right.

153
00:13:08,010 --> 00:13:11,080
Dense on the scale five dense undisclosed six.

154
00:13:11,130 --> 00:13:17,340
And this has to do with the fact that if I go to the cell here hit shift enter again and now refresh

155
00:13:17,340 --> 00:13:25,110
my summary the names of the leaves change the update because they're auto generated given that this

156
00:13:25,110 --> 00:13:26,950
is a little bit confusing.

157
00:13:27,000 --> 00:13:33,270
One of the things that we can do is we can actually give these layers a name as well so we can come

158
00:13:33,270 --> 00:13:42,000
back up to where we defined our model and say name is equal to single quotes and one on a score hidden

159
00:13:42,000 --> 00:13:42,380
one.

160
00:13:43,260 --> 00:13:54,380
And I'm going to give my other layers a name as well and 1 hidden 2 1 and 3 and m 1 output if I hit

161
00:13:54,380 --> 00:14:00,380
shift enter on the cell again and I come down here to my summary refresh it then then you'll see that

162
00:14:00,380 --> 00:14:05,640
these names are now updated to the names that we specified here.

163
00:14:05,900 --> 00:14:07,390
So that's quite handy right.

164
00:14:08,780 --> 00:14:14,960
But there's one other big conceptual topic that I want to talk about now and this has to do with the

165
00:14:14,960 --> 00:14:22,180
structure of our neural network because so far we've looked at it a little bit of a simplistic way and

166
00:14:22,270 --> 00:14:27,200
now has a good chance to appreciate it a bit more fully.

167
00:14:27,200 --> 00:14:32,500
Let's work out where this total number of parameters actually comes from.

168
00:14:32,540 --> 00:14:38,690
Why is it so high and can we calculate this manually in order to better understand all the things that

169
00:14:38,690 --> 00:14:43,970
have to be estimated for our neural network in our introductory lectures.

170
00:14:43,970 --> 00:14:50,180
We talked a little bit about how important the connection weights were for the neurons and we talked

171
00:14:50,180 --> 00:14:56,420
about how the number of connections really grew with the size and complexity of the model.

172
00:14:57,230 --> 00:15:05,430
And this in turn had an impact on how much data was required to accurately estimate all of these parameters.

173
00:15:05,570 --> 00:15:12,430
This very simple model here on this slide had about 90 different connections between the neurons.

174
00:15:12,440 --> 00:15:17,720
Now just because there are 90 different connections 90 different weights to estimate doesn't mean this

175
00:15:17,720 --> 00:15:25,250
is actually the total number of parameters you see the thing is individual neurons don't actually just

176
00:15:25,250 --> 00:15:31,460
have weights individual neurons also have a bias.

177
00:15:31,500 --> 00:15:35,070
Now at this point I'd like you to make another mental leap.

178
00:15:35,150 --> 00:15:42,380
The key thing when talking about neurons in a neural network are actually their activation functions

179
00:15:43,070 --> 00:15:49,070
the activation functions determine how strong in neuron will fire.

180
00:15:49,070 --> 00:15:54,010
As a matter of fact a neuron kind of is the activation function.

181
00:15:54,440 --> 00:16:02,270
So when we talked about learning and changing the connection weights what we're actually doing is changing

182
00:16:02,390 --> 00:16:11,840
the activation function for example an activation function might become steeper or flatter.

183
00:16:11,840 --> 00:16:16,840
This is what changing the weights on the learning step actually translates into.

184
00:16:17,060 --> 00:16:24,330
It affects the shape of this activation function and as this activation function changes it of course

185
00:16:24,420 --> 00:16:27,870
affects how strong that neuron files now.

186
00:16:27,870 --> 00:16:32,880
Changing the weights makes an activation function steep or flat.

187
00:16:32,880 --> 00:16:37,530
What about shifting the activation function from say left to right.

188
00:16:37,530 --> 00:16:40,390
Well that's what the bias does.

189
00:16:40,410 --> 00:16:50,070
The bias is what can shift the entire curve so if these two things the weights and the bias and individual

190
00:16:50,070 --> 00:16:53,740
activation function can be manipulated it can be stretched.

191
00:16:53,820 --> 00:17:00,990
It can be made steep or flat or moved up or moved down or moved left or moved right.

192
00:17:00,990 --> 00:17:07,140
All of these are just parameters that the neural network can update as it learns.

193
00:17:07,230 --> 00:17:14,760
So let's come back into Jupiter notebook and make sense of the total number of parameters in this summary.

194
00:17:14,820 --> 00:17:21,780
Let's work it out for each individual layer that very first layer has about three hundred and ninety

195
00:17:21,780 --> 00:17:27,150
three thousand three hundred and forty four different parameters.

196
00:17:27,150 --> 00:17:28,680
Why is that.

197
00:17:28,680 --> 00:17:35,650
Well the size of our inputs we said was 32 times 32 times three.

198
00:17:35,820 --> 00:17:37,660
Right.

199
00:17:37,770 --> 00:17:43,300
And then we had one hundred and twenty eight neurons in this layer.

200
00:17:43,380 --> 00:17:47,340
So all of that time's one hundred and twenty eight.

201
00:17:47,880 --> 00:17:54,660
So all of this will give us the number of connection weights and that's equal to three hundred and ninety

202
00:17:54,660 --> 00:17:58,340
three thousand two hundred and sixteen.

203
00:17:58,500 --> 00:17:59,430
So what's missing.

204
00:17:59,940 --> 00:18:06,120
Well it's our bias parameters you see each neuron doesn't just have weights.

205
00:18:06,210 --> 00:18:08,350
It also has a bias.

206
00:18:08,370 --> 00:18:16,710
So if we add one hundred and twenty eight to this total then we arrive at three hundred and ninety three

207
00:18:16,710 --> 00:18:22,560
thousand three hundred and forty four to arrive at the grand total.

208
00:18:22,980 --> 00:18:27,530
We have to do this calculation for the three remaining layers.

209
00:18:27,570 --> 00:18:39,990
In other words Layer number two is gonna be equal to 128 inputs times sixty four neurons plus another

210
00:18:39,990 --> 00:18:44,160
60 for biased terms for each single neuron.

211
00:18:44,190 --> 00:18:56,340
The second hidden layer the third hidden layer is just going to be 64 inputs times 60 neurons plus 16

212
00:18:56,340 --> 00:18:59,300
biased terms one for each neuron.

213
00:18:59,460 --> 00:19:06,300
And of course that finally is gonna be 16 inputs to the final layer from the third hidden layer.

214
00:19:06,300 --> 00:19:15,960
Times 10 neurons plus 10 biased terms hitting shift enter on the cell gives us our answer our grand

215
00:19:15,960 --> 00:19:22,850
total four hundred and two thousand eight hundred and ten brilliant.

216
00:19:22,870 --> 00:19:31,720
This was a very dense lesson pun intended but such as the way and we're going to be moving on to greater

217
00:19:31,720 --> 00:19:38,860
things namely we're gonna set setup tensor board which will allow us to watch what our model is doing

218
00:19:39,070 --> 00:19:42,860
while it's training and then we're going to train our model.

219
00:19:42,970 --> 00:19:49,660
So there's some really cool stuff coming up in the next lessons and looking forward to seeing you there.

220
00:19:49,720 --> 00:19:50,290
Take care.