1
00:00:01,270 --> 00:00:07,960
In this lesson we will talk about some fundamental concepts of machine learning namely over fitting

2
00:00:08,110 --> 00:00:10,660
and regularization.

3
00:00:10,660 --> 00:00:17,440
The reason we're talking about it now is because we've encountered this problem in our last lesson there.

4
00:00:17,500 --> 00:00:23,920
We saw that after one hundred and fifty epochs our validation era started to creep up as we continue

5
00:00:23,920 --> 00:00:25,620
to train our model.

6
00:00:25,750 --> 00:00:29,170
So why did this happen and why is it a problem.

7
00:00:30,010 --> 00:00:37,840
Well all of this has to do with a phenomenon called over fitting over fitting happens when your model

8
00:00:37,960 --> 00:00:45,190
learns the training data too well meaning the model starts to learn all the little quirks that are present

9
00:00:45,250 --> 00:00:52,330
in your training data and it becomes better and better at fitting itself to the training data.

10
00:00:52,330 --> 00:00:58,270
It not only learns the relationships that are present in your training data but it also learns all the

11
00:00:58,270 --> 00:01:04,650
noise that's there too as a consequence the model becomes unable to generalize well.

12
00:01:04,840 --> 00:01:10,480
In other words the model becomes unable to predict future observations outside of the training data

13
00:01:10,480 --> 00:01:18,760
said reliably so a consequence of over fitting is poor performance on the validation data set on the

14
00:01:18,760 --> 00:01:23,350
test data set and any future observations that you collect.

15
00:01:23,380 --> 00:01:28,960
The easiest way to understand this problem of over fitting is visually.

16
00:01:28,960 --> 00:01:32,830
So I'm going to show you a series of chance to build out this intuition.

17
00:01:32,830 --> 00:01:39,520
Suppose we've got some sample data here we've got some x's and some Y's and our goal is to plot some

18
00:01:39,520 --> 00:01:41,040
regression line.

19
00:01:41,050 --> 00:01:48,150
Now if we try to draw a line that connected as many of these points as possible then you're going to

20
00:01:48,150 --> 00:01:49,980
end up with a really squiggly line.

21
00:01:50,380 --> 00:01:53,280
Maybe the line would end up looking like this.

22
00:01:53,500 --> 00:02:00,700
And yes this line minimizes the distance between itself and all the points but it's really looking quite

23
00:02:00,790 --> 00:02:01,750
complex.

24
00:02:01,750 --> 00:02:02,680
That's the first thing.

25
00:02:03,070 --> 00:02:07,560
And the second thing is that it's not looking terribly realistic.

26
00:02:07,750 --> 00:02:13,120
And this lack of realism actually becomes clear once more data points are collected and added to this

27
00:02:13,120 --> 00:02:14,320
plot.

28
00:02:14,320 --> 00:02:20,950
If we do that this line starts to fit the new data rather poorly meaning this model is a clear example

29
00:02:21,010 --> 00:02:23,220
of over fitting.

30
00:02:23,380 --> 00:02:29,890
But conversely what this also means is that collecting more data is actually a good way to combat or

31
00:02:29,890 --> 00:02:31,650
fitting in the first place.

32
00:02:31,750 --> 00:02:38,050
The thing is over fitting is actually a problem that is present across many machine learning techniques

33
00:02:38,380 --> 00:02:46,030
not just neural networks but neural networks are actually particularly prone to over fitting.

34
00:02:46,030 --> 00:02:47,000
Why.

35
00:02:47,020 --> 00:02:50,710
Well neural networks tend to have many many parameters.

36
00:02:50,710 --> 00:02:54,200
Our model alone had around 400000.

37
00:02:54,220 --> 00:03:00,060
Compare that with a very complex regression with say 100 variables or 20.

38
00:03:00,280 --> 00:03:06,190
In effect the more complex the model the more prone it is to over fitting.

39
00:03:06,190 --> 00:03:12,970
Neural networks are actually able to learn the training data especially well so if this is an example

40
00:03:13,210 --> 00:03:18,730
of a model over fitting the data what does the opposite look like.

41
00:03:18,920 --> 00:03:23,050
What would under fitting looked like in the regression context.

42
00:03:23,050 --> 00:03:25,360
Well it would look something like this.

43
00:03:25,420 --> 00:03:31,720
In this case the model was not really learning the relationship present in the training data well enough.

44
00:03:31,750 --> 00:03:38,040
This is a good analogy to what we had when we only train our neural network for one epoch.

45
00:03:38,230 --> 00:03:43,180
What you would want ideally is actually a balance you would want your model to learn the relationships

46
00:03:43,330 --> 00:03:47,210
that are present in the data but you don't want the model to over fit.

47
00:03:47,530 --> 00:03:53,170
Now just to show you another example Suppose you had a classification problem instead of a regression

48
00:03:53,170 --> 00:03:59,300
problem what would overfishing look like in this case with a classification problem.

49
00:03:59,350 --> 00:04:05,620
It's actually the decision boundary that starts to look rather strange and very esoteric when over fitting

50
00:04:05,620 --> 00:04:10,670
is present the Green Line is a model that's over fitting to the data.

51
00:04:10,870 --> 00:04:17,380
It's trying very hard to divide up those two groups and as a consequence the model is overreacting to

52
00:04:17,380 --> 00:04:23,260
the noise that's present in the training data set a much better decision boundary would be something

53
00:04:23,260 --> 00:04:30,550
like the black line the black line is likely to generalize much better and gives better results on the

54
00:04:30,550 --> 00:04:33,830
validation and the test data sets.

55
00:04:33,860 --> 00:04:40,280
So now that we know what over fitting and under fitting is the question is how do we diagnose the problems

56
00:04:40,370 --> 00:04:41,540
and how do we prevent them.

57
00:04:42,650 --> 00:04:48,770
Well we've already talked about one way of diagnosing over fitting is looking at the performance on

58
00:04:48,770 --> 00:04:56,720
the validation data set is the loss on the validation data set still decreasing or is it no longer decreasing

59
00:04:56,840 --> 00:04:58,720
or is it even increasing.

60
00:04:58,760 --> 00:05:01,430
These are the kind of signs to watch out for.

61
00:05:01,490 --> 00:05:05,220
Also what's the performance on the validation data set.

62
00:05:05,240 --> 00:05:08,510
How is the accuracy changing as we're training the model.

63
00:05:08,510 --> 00:05:14,510
And is there a big difference between the accuracy on the training data set versus the accuracy on the

64
00:05:14,510 --> 00:05:16,000
validation data set.

65
00:05:16,010 --> 00:05:19,040
These are the kind of signs that you need to keep an eye out for.

66
00:05:19,970 --> 00:05:23,310
Now suppose that we've diagnosed this problem for a fitting.

67
00:05:23,360 --> 00:05:25,090
How do we fix it.

68
00:05:25,100 --> 00:05:33,050
Well the techniques for preventing over fitting are called regularization and regularization actually

69
00:05:33,050 --> 00:05:40,580
encompasses a variety of techniques all of them tend to impose some sort of constraint or some sort

70
00:05:40,580 --> 00:05:43,600
of penalty on complexity.

71
00:05:43,640 --> 00:05:50,360
The theoretical justification for penalizing complexity is that a simpler solution is preferable to

72
00:05:50,360 --> 00:05:51,570
a complex one.

73
00:05:51,810 --> 00:05:55,670
And this is actually very much in line with the Zen of Python remember.

74
00:05:56,000 --> 00:06:01,400
Simple is better than complex and complex is better than complicated.

75
00:06:01,400 --> 00:06:08,830
Looking at the validation loss on tensor board one regularization technique already seems obvious.

76
00:06:08,900 --> 00:06:15,620
If the problem is is that we're training our model over too many epochs then all we need to do is to

77
00:06:15,620 --> 00:06:17,300
stop overtraining it right.

78
00:06:17,300 --> 00:06:18,630
Case closed.

79
00:06:18,770 --> 00:06:25,640
Why not stop if the validation accuracy is no longer going up and our validation error is no longer

80
00:06:25,640 --> 00:06:27,380
decreasing.

81
00:06:27,380 --> 00:06:31,690
And the answer is that yes this is actually a completely valid technique.

82
00:06:31,730 --> 00:06:33,870
This technique even has a name.

83
00:06:33,980 --> 00:06:39,830
It's called early stopping and if you wanted to implemented carries actually gives us the option to

84
00:06:39,830 --> 00:06:42,580
do so in the callback section.

85
00:06:42,710 --> 00:06:48,380
You can actually search for early stopping and there you can find that you can stop the training when

86
00:06:48,380 --> 00:06:54,410
a monitored quantity like see the validation loss has stopped improving.

87
00:06:54,410 --> 00:07:01,330
So all we would need to do is at early stopping as another callback to our list of callbacks right here

88
00:07:02,390 --> 00:07:07,850
but in addition to early stopping there is another very very powerful technique that I'd like to show

89
00:07:07,850 --> 00:07:15,560
you and this regularization technique is called dropout the dropout technique was actually published

90
00:07:15,560 --> 00:07:19,630
by a team of researchers in 2014.

91
00:07:19,940 --> 00:07:26,210
They found that if you randomly ignore some of the neurons during the training then you can reduce over

92
00:07:26,210 --> 00:07:27,230
fitting.

93
00:07:27,230 --> 00:07:33,560
In other words during each training step some random neuron either in the input layer or in the hidden

94
00:07:33,560 --> 00:07:36,000
layers is not considered.

95
00:07:36,120 --> 00:07:37,610
Say this is our neural network.

96
00:07:38,240 --> 00:07:45,110
If we apply the drop out technique to the input layer then we can specify a chance for every single

97
00:07:45,110 --> 00:07:49,460
one of these neurons to not be considered during the training.

98
00:07:49,490 --> 00:07:56,090
So if there's a 20 percent chance that each of these neurons can drop out then during the first training

99
00:07:56,090 --> 00:08:02,130
step maybe the first neuron and all of its connections will be ignored.

100
00:08:02,240 --> 00:08:09,950
If this neuron and all of its connections drop out then the network shrinks it becomes a less complex

101
00:08:09,950 --> 00:08:15,080
network because there are fewer connections during the next training step.

102
00:08:15,080 --> 00:08:20,450
A different neuron might drop out the neuron that dropped out during the first training step will come

103
00:08:20,450 --> 00:08:25,400
back and a different one might drop out with a 20 percent probability.

104
00:08:25,400 --> 00:08:28,340
Now this technique might seem rather strange right.

105
00:08:28,340 --> 00:08:29,830
Why does this work.

106
00:08:29,870 --> 00:08:35,840
Well it means that the connected neurons all the neurons that are downstream in that first hidden layer

107
00:08:36,410 --> 00:08:40,160
now don't want to rely too heavily on any single input.

108
00:08:40,700 --> 00:08:47,240
If a random neuron drops out during the training step every time all of the connected neurons will try

109
00:08:47,240 --> 00:08:54,110
to hedge themselves in order not to weigh any particular input too heavily and this will help prevent

110
00:08:54,170 --> 00:08:55,050
over fitting.

111
00:08:55,790 --> 00:09:02,510
But let me ask you this with a network that implements the drop out technique learn differently than

112
00:09:02,510 --> 00:09:05,740
a network that has no dropout.

113
00:09:05,780 --> 00:09:07,070
What about speed.

114
00:09:07,070 --> 00:09:14,400
Will it learn just as quickly or will it take more time to answer these questions let's dive back into

115
00:09:14,400 --> 00:09:15,210
our code.

116
00:09:15,210 --> 00:09:22,230
Let's create a second model in Jupiter and find out because we can use tensor board to compare our second

117
00:09:22,230 --> 00:09:26,630
model with our first model and see how they learn differently.

118
00:09:26,640 --> 00:09:31,380
Let's go to this section here in our Jupiter notebook where we're defining our model.

119
00:09:31,440 --> 00:09:38,020
I'm going to add another cell below and here I'm going to create our model underscored two.

120
00:09:38,100 --> 00:09:42,530
And this is also going to be a sequential model.

121
00:09:42,660 --> 00:09:49,360
Now previously we've added all the layers inside of these square brackets of the model.

122
00:09:49,470 --> 00:09:52,300
But this isn't the only way we can do things.

123
00:09:52,440 --> 00:09:58,860
In fact when we go to the getting started guide we see that Chris also has this add method to add the

124
00:09:58,860 --> 00:10:01,230
different layers to the model.

125
00:10:01,230 --> 00:10:07,710
So after you've created the model you can call dot add and then supply the layer between the parentheses

126
00:10:07,950 --> 00:10:10,370
that you want to add to the model.

127
00:10:10,440 --> 00:10:16,740
So this isn't an alternative way to write your code in Caris and let's try this out since we might often

128
00:10:16,740 --> 00:10:21,100
see this on blog posts or on stack overflow.

129
00:10:21,240 --> 00:10:26,580
The first thing I'm actually going to do is recreate all of these layers in Model Number Two.

130
00:10:26,580 --> 00:10:32,800
So I'll say model underscore to dot and and I'll add my first dense layer.

131
00:10:32,940 --> 00:10:40,630
This is going to be 128 units with the activation being redo.

132
00:10:40,650 --> 00:10:42,570
Now I'm going to give this layer a name.

133
00:10:42,570 --> 00:10:48,830
I'm going to call it m to underscore hidden one.

134
00:10:48,910 --> 00:10:51,490
Now when I had the other layers very quickly.

135
00:10:51,600 --> 00:11:04,260
So just copy this line paste paste paste and I want to change the inputs here to 64 16 and 10.

136
00:11:04,320 --> 00:11:12,090
The last activation should be soft Max and the layer names of course have to be updated as well.

137
00:11:12,090 --> 00:11:15,640
So this is my output layer output.

138
00:11:15,870 --> 00:11:17,420
This is hidden 3.

139
00:11:17,430 --> 00:11:18,990
This is hidden too.

140
00:11:19,050 --> 00:11:20,630
And there we go.

141
00:11:20,640 --> 00:11:23,790
I've added all my layers as before.

142
00:11:23,790 --> 00:11:29,080
So this code and this code is actually equivalent.

143
00:11:29,130 --> 00:11:33,410
The only thing that's left to do is to add my drop out.

144
00:11:33,720 --> 00:11:35,640
Now here's how I'm going to do it.

145
00:11:35,640 --> 00:11:38,880
First I'm going to check the caris documentation.

146
00:11:39,300 --> 00:11:43,490
So under layers I can see dents which I've used before.

147
00:11:44,160 --> 00:11:53,580
And I can scroll down and there I will find dropout my dropout layer will require some probability between

148
00:11:53,580 --> 00:12:02,220
zero and one as the dropout probability and the convention is that numbers between 20 percent and 50

149
00:12:02,220 --> 00:12:09,750
percent work rather well we're going to try out 20 percent if you'd like to have the same random neurons

150
00:12:09,750 --> 00:12:10,840
drop out as me.

151
00:12:11,010 --> 00:12:13,550
Then we're going to set the same seed.

152
00:12:14,040 --> 00:12:16,720
So let's implement this back in our notebook.

153
00:12:16,890 --> 00:12:26,010
The first thing we're gonna do is we're actually going to input dropout backup on our import statements

154
00:12:26,150 --> 00:12:27,840
under Kerry's start layers.

155
00:12:27,840 --> 00:12:31,610
We're going to import dense activation and crop out.

156
00:12:31,680 --> 00:12:34,500
So shift enter on the cell.

157
00:12:34,500 --> 00:12:42,150
Now we can scroll down to where we created our second model and write model underscore to dot ad and

158
00:12:42,150 --> 00:12:44,400
then drop out.

159
00:12:44,400 --> 00:12:51,810
Now our dropout probability which we said was gonna be zero point two and our seed is gonna be equal

160
00:12:51,810 --> 00:12:58,950
to say forty two but there is one more thing we have to specify because this will kind of act as the

161
00:12:58,950 --> 00:13:00,710
first layer in our model.

162
00:13:00,810 --> 00:13:08,820
We have to specify the input shape for our dense layer the parameter name was input underscore DRM four

163
00:13:08,820 --> 00:13:16,520
dimensions but for our dropout the name is actually can I read input on a score shape.

164
00:13:16,800 --> 00:13:27,680
And this will be equal to a tuple so parentheses and then total on a score inputs and then a comma afterwards.

165
00:13:27,810 --> 00:13:34,450
And that is how we're going to specify the total number of inputs for our dropout.

166
00:13:34,470 --> 00:13:40,990
Having added this code we're going to apply our dropout to our input later.

167
00:13:41,030 --> 00:13:44,240
The only thing left to do is to compile our model.

168
00:13:44,240 --> 00:13:44,970
Right.

169
00:13:45,170 --> 00:13:48,170
And we can actually do this by copying this code.

170
00:13:48,170 --> 00:13:53,900
We're going to use the same optimizer can use the same loss when we use the same metrics so I kind of

171
00:13:53,910 --> 00:14:02,710
pasted in here and I going to say model to dot compile fantastic let's head shift into here.

172
00:14:03,320 --> 00:14:13,100
And now we can come down here and we can copy the cell paste it and we can change model one to model

173
00:14:13,140 --> 00:14:22,160
two can change the name for tensor board from model 1 to model 2 and then we can finally fit our model

174
00:14:22,280 --> 00:14:26,630
and see how it will behave compared to our model without the dropout.

175
00:14:27,440 --> 00:14:28,790
So let's try it out.

176
00:14:28,910 --> 00:14:30,300
I'm very curious already.

177
00:14:30,410 --> 00:14:37,650
That's head shift enter on the cell and switch over to tensor board and let's play the waiting game.

178
00:14:37,910 --> 00:14:44,570
If I scroll down here in the bottom left corner then I can see model to add 16 0 8.

179
00:14:44,570 --> 00:14:45,660
So here we go.

180
00:14:45,860 --> 00:14:48,880
We can see that the loss is coming down.

181
00:14:48,920 --> 00:14:50,160
This is still an epoch.

182
00:14:50,180 --> 00:15:00,790
14 and as the model continues to run the validation loss continues to decrease and then it has a bit

183
00:15:00,790 --> 00:15:05,950
of a wobble here and then it's slowly ticking up again by around epoch.

184
00:15:05,970 --> 00:15:08,330
One hundred and thirty.

185
00:15:08,350 --> 00:15:15,560
So what we're seeing here is that the drop out does indeed reduce over fitting but it will not eliminate

186
00:15:15,560 --> 00:15:16,960
it completely.

187
00:15:16,960 --> 00:15:23,110
We do have to use some combination of early stopping stopping to train our model at a certain point

188
00:15:23,680 --> 00:15:27,430
and the drop out technique to get a better result.

189
00:15:27,430 --> 00:15:31,300
Now let's take a look at our validation accuracy.

190
00:15:31,300 --> 00:15:36,080
One thing that's quite interesting is that we actually haven't lost out on much accuracy.

191
00:15:36,370 --> 00:15:43,630
Model number two with the dropout and model number one without the dropout are around 30 percent accurate

192
00:15:43,900 --> 00:15:45,910
by the end of the training.

193
00:15:46,060 --> 00:15:51,500
One thing that is very noticeable is that the dropout model tends to learn a bit more slowly.

194
00:15:51,520 --> 00:15:57,310
It tends to learn the training data set slower than the model without the dropout and this is exactly

195
00:15:57,310 --> 00:15:58,430
what we intended.

196
00:15:58,540 --> 00:16:05,130
By the way if there are too many crafts on him then you can toggle them down here on the left hand side.

197
00:16:05,140 --> 00:16:09,880
Suppose I just want to see model number two then I can hit this radio button here and if I want to do

198
00:16:09,880 --> 00:16:17,170
a little horse race with Model 1 and model 2 then I can take the individual tick boxes here to get them

199
00:16:17,230 --> 00:16:25,590
side by side like so on the chart remember high said that we could apply the dropout to both the input

200
00:16:25,590 --> 00:16:27,630
layer as well as to a hidden layer.

201
00:16:28,440 --> 00:16:31,290
Well I've now got a challenge for you.

202
00:16:31,290 --> 00:16:34,580
What I'd like you to do is create a third model.

203
00:16:34,680 --> 00:16:39,020
Call it model on a score three that has to dropout layers.

204
00:16:39,060 --> 00:16:42,270
The first dropout layer should be the same as with Model Number Two.

205
00:16:42,540 --> 00:16:49,170
So it should be applied on the inputs but the second dropout layer should be added after the first hidden

206
00:16:49,170 --> 00:16:49,770
layer.

207
00:16:49,860 --> 00:16:53,510
And have a dropout rate of 25 percent.

208
00:16:53,550 --> 00:17:00,640
I'll give you a few seconds to pause the video and give this a go.

209
00:17:00,680 --> 00:17:01,070
Ready.

210
00:17:02,060 --> 00:17:03,590
Here's the solution.

211
00:17:03,590 --> 00:17:12,080
So I'm just going to take this cell copy it and paste it move it down and I want to change all the references

212
00:17:12,410 --> 00:17:21,460
from Hoddle to to model 3 and after I've done that allowed the second dropout layer here model on a

213
00:17:21,500 --> 00:17:28,710
school three dot and dropout zero point two five.

214
00:17:28,720 --> 00:17:35,990
Now you're free to use the same seed as before but in this case you don't actually have to add anything

215
00:17:36,380 --> 00:17:43,970
afterwards because Chris is smart enough to infer the number of inputs on this dropout layer because

216
00:17:43,970 --> 00:17:47,350
it is not the first layer and that's it.

217
00:17:48,050 --> 00:17:50,170
It's all done.

218
00:17:50,180 --> 00:17:56,270
Now we're going to get to the very interesting part we're actually going to train our model on our entire

219
00:17:56,270 --> 00:18:01,160
training dataset and no longer this extra small training dataset that we've had before.

220
00:18:01,670 --> 00:18:08,030
We're going to see what a difference it makes when we're moving from a training data set of 1000 samples

221
00:18:08,360 --> 00:18:12,120
to 40000 samples so check it out.

222
00:18:12,720 --> 00:18:16,440
I'm going to copy this cell pasted below.

223
00:18:16,530 --> 00:18:20,280
I'm going to take this one here and comment it out.

224
00:18:20,280 --> 00:18:26,640
I'll keep it around for reference but I'm going to take model one and we're going to reduce the number

225
00:18:26,640 --> 00:18:28,880
of epochs we're gonna train this model on.

226
00:18:29,010 --> 00:18:36,060
We're gonna go down from one hundred and fifty to one hundred and then we'll change our training dataset

227
00:18:36,390 --> 00:18:45,420
to X on this train and Y underscore train and we're going to call this model model 1 x l That way we

228
00:18:45,420 --> 00:18:52,660
know we're dealing with the large dataset I'll copy the cell I'll paste it again and here we're gonna

229
00:18:52,680 --> 00:19:02,400
change this to model 2 and Model 2 right here and copy that paste it again and where to change this

230
00:19:02,490 --> 00:19:08,980
to Model 3 and when change in this model name to Model 3 as well.

231
00:19:09,480 --> 00:19:16,950
This last cell here I'm going to comment out and I want to move this up slightly and now I should be

232
00:19:16,950 --> 00:19:27,040
left with these three cells I'm very very quickly going to come up here and recompile Model 1 and 2

233
00:19:27,040 --> 00:19:33,070
before I run my code so that I know that they're all gonna be starting from scratch and the last thing

234
00:19:33,070 --> 00:19:39,910
I want to do is I'm going to go into tensor board and here I'm actually only going to keep these two

235
00:19:39,910 --> 00:19:47,410
lines around I basically want to delete all these other runs that we've had the way I want to do this

236
00:19:47,530 --> 00:19:53,260
is I wanted to leave my files directly so the two models that I want to keep around are model one at

237
00:19:53,260 --> 00:20:00,760
eleven fifty six and Model 2 at 16 0 8 and the other ones I'm no longer that interested in these are

238
00:20:00,760 --> 00:20:10,150
all legacy runs and then I'm going to come back into Jupiter and I'm going to run this cell run the

239
00:20:10,150 --> 00:20:18,490
cell and run this cell now my computer is really working hard it's kind of crunching a lot of numbers

240
00:20:18,910 --> 00:20:26,980
and we can go into tenths aboard and we can refresh the entire thing so that it will redraw charts and

241
00:20:27,040 --> 00:20:35,260
we can watch but I think it's actually much more advisable to get a coffee at this point because this

242
00:20:35,260 --> 00:20:42,030
could really take a while one of the things though that's gonna be obvious almost immediately is the

243
00:20:42,030 --> 00:20:48,390
massive improvement in the validation accuracy that you're going to be getting from your model as soon

244
00:20:48,390 --> 00:20:57,030
as you add more data when we only had 1000 training examples we were doing very very poorly on the validation

245
00:20:57,030 --> 00:21:03,240
data set and the number of epochs that we're using to train our model wasn't able to even make that

246
00:21:03,240 --> 00:21:10,650
much of a difference but look at what happens with version one of our model when we went from 1000 examples

247
00:21:10,860 --> 00:21:19,280
to 40000 examples this is really why people say that data is the new oil the more data the better.

248
00:21:20,320 --> 00:21:27,780
All right now all my three versions of my model have finished training and you can see that the first

249
00:21:27,780 --> 00:21:34,590
one took four minutes the next one took about six minutes and the last one took me about eight minutes.

250
00:21:34,590 --> 00:21:41,520
Model Number Two had dropped out on the input layer and model number three had dropped on both the input

251
00:21:41,520 --> 00:21:43,890
layer and the first hidden layer.

252
00:21:44,550 --> 00:21:51,490
Let's compare how these models did in 10 support the first thing that we're looking at is the accuracy

253
00:21:51,580 --> 00:21:53,600
on the training data set.

254
00:21:53,890 --> 00:21:56,710
And here what we see is actually quite interesting.

255
00:21:56,770 --> 00:22:02,590
So the red and orange lines are the models on the small training data set and the light blue pink and

256
00:22:02,590 --> 00:22:06,820
dark red lines are on the larger training data set.

257
00:22:06,820 --> 00:22:12,910
What we see is that model number three with the most amount of drop out learned the training data said

258
00:22:13,210 --> 00:22:15,040
the most slowly.

259
00:22:15,040 --> 00:22:21,550
This is especially obvious if you look at what was happening from epoch number 50 onwards that pink

260
00:22:21,550 --> 00:22:28,150
line is far below the other two I think model number two got off to a very strange start.

261
00:22:28,180 --> 00:22:30,590
Initially it seemed not to learn very much at all.

262
00:22:30,790 --> 00:22:35,420
And then the accuracy improved dramatically within a few epochs.

263
00:22:35,440 --> 00:22:40,950
Now I suspect this might be due to the weight initialization at the start.

264
00:22:40,980 --> 00:22:46,550
There is some randomness to the starting point for these algorithms with the images.

265
00:22:46,600 --> 00:22:52,840
So if we were to fit this model again we might get a slightly different line but nonetheless we see

266
00:22:52,840 --> 00:23:01,950
that models with more drop out learn more slowly the next jaunt intensive board shows us the training

267
00:23:01,950 --> 00:23:02,760
loss.

268
00:23:02,760 --> 00:23:08,610
So here we see that for every single run that we've done all of the models reduce the training loss

269
00:23:08,760 --> 00:23:10,220
the more they train.

270
00:23:10,260 --> 00:23:15,030
This is expected and also not that interesting.

271
00:23:15,030 --> 00:23:18,860
So let's go down to the validation accuracy.

272
00:23:18,900 --> 00:23:25,050
This is much more interesting because here we see the dramatic difference between the models that were

273
00:23:25,050 --> 00:23:31,210
trained on a small data set of a thousand samples and the models that were trained on the larger data

274
00:23:31,210 --> 00:23:34,300
set of 40000 samples.

275
00:23:34,300 --> 00:23:42,210
Now mind you this is the validation accuracy on the 10000 samples in the validation data set.

276
00:23:42,480 --> 00:23:51,960
And what we see is that model 1 2 and 3 all end up between 47 and 49 percent accuracy on the validation

277
00:23:51,960 --> 00:23:53,000
data set.

278
00:23:53,130 --> 00:23:58,400
But what we also see is that the accuracy didn't really change much towards the end.

279
00:23:58,530 --> 00:24:02,490
And this can indicate overtraining and over fitting.

280
00:24:02,490 --> 00:24:10,800
Let's scroll down to the validation loss and see if this is true and what we see here is that model

281
00:24:10,800 --> 00:24:18,270
2 and 3 which have the dropout feature have a validation loss which continues to decrease although very

282
00:24:18,270 --> 00:24:21,710
very slowly almost right up to the end.

283
00:24:22,200 --> 00:24:28,080
Model number one on the other hand has a validation loss that's pretty much static after epoch number

284
00:24:28,080 --> 00:24:29,180
50.

285
00:24:29,370 --> 00:24:35,580
And this to me suggests that we should probably stop training model number one a little earlier so where

286
00:24:35,580 --> 00:24:36,420
does this leave us.

287
00:24:36,960 --> 00:24:40,440
Well there's a couple of things that we found out.

288
00:24:40,440 --> 00:24:46,770
One is that the amount of data makes a massive difference increasing the amount of data in the training

289
00:24:46,820 --> 00:24:54,570
dataset can help reduce overfishing dramatically and it will also boost the accuracy on data that the

290
00:24:54,570 --> 00:24:56,610
model hasn't seen before.

291
00:24:56,610 --> 00:25:02,130
The second thing that we've seen is that early stopping is a useful technique for preventing over fitting

292
00:25:03,090 --> 00:25:08,010
training our models are more and more epochs is not necessarily a good thing.

293
00:25:08,010 --> 00:25:15,180
The third thing that we've seen is that drop out is a useful technique for reducing over fitting models

294
00:25:15,210 --> 00:25:20,250
that do implement the drop out technique learn a little bit more slowly but their validation errors

295
00:25:20,430 --> 00:25:26,610
creep up much later than those without drop out and the fourth thing that we saw is a little bit more

296
00:25:26,610 --> 00:25:34,920
on the magnitude of these impacts increasing the amount of data 40 fold had a much bigger impact than

297
00:25:34,920 --> 00:25:41,420
adding one or two drop out layers to the structure of our model judging by the validation accuracy.

298
00:25:41,650 --> 00:25:47,520
At the moment our models are getting approximately 50 percent of the classifications correct.

299
00:25:47,590 --> 00:25:53,470
Now if this was a spam filter this would be terrible because in that case an email is other spam or

300
00:25:53,470 --> 00:25:54,670
not spam.

301
00:25:54,670 --> 00:25:59,440
But in this case we're actually differentiating between 10 different things and getting 50 percent of

302
00:25:59,440 --> 00:26:04,870
those classifications correct doesn't actually seem to be such a bad thing especially if this is our

303
00:26:04,870 --> 00:26:07,810
first go in the next lessons.

304
00:26:07,880 --> 00:26:11,480
We're going to be evaluating these models a little bit more closely.

305
00:26:11,570 --> 00:26:17,270
We're going to take a hard look at the predictions these models make on individual images and we're

306
00:26:17,270 --> 00:26:23,670
going to evaluate how our favorite model fares on the test data center for all of that and more.

307
00:26:23,750 --> 00:26:25,340
I'll see you in the next lesson.

308
00:26:25,340 --> 00:26:25,900
Take care.