1 00:00:01,270 --> 00:00:07,960 In this lesson we will talk about some fundamental concepts of machine learning namely over fitting 2 00:00:08,110 --> 00:00:10,660 and regularization. 3 00:00:10,660 --> 00:00:17,440 The reason we're talking about it now is because we've encountered this problem in our last lesson there. 4 00:00:17,500 --> 00:00:23,920 We saw that after one hundred and fifty epochs our validation era started to creep up as we continue 5 00:00:23,920 --> 00:00:25,620 to train our model. 6 00:00:25,750 --> 00:00:29,170 So why did this happen and why is it a problem. 7 00:00:30,010 --> 00:00:37,840 Well all of this has to do with a phenomenon called over fitting over fitting happens when your model 8 00:00:37,960 --> 00:00:45,190 learns the training data too well meaning the model starts to learn all the little quirks that are present 9 00:00:45,250 --> 00:00:52,330 in your training data and it becomes better and better at fitting itself to the training data. 10 00:00:52,330 --> 00:00:58,270 It not only learns the relationships that are present in your training data but it also learns all the 11 00:00:58,270 --> 00:01:04,650 noise that's there too as a consequence the model becomes unable to generalize well. 12 00:01:04,840 --> 00:01:10,480 In other words the model becomes unable to predict future observations outside of the training data 13 00:01:10,480 --> 00:01:18,760 said reliably so a consequence of over fitting is poor performance on the validation data set on the 14 00:01:18,760 --> 00:01:23,350 test data set and any future observations that you collect. 15 00:01:23,380 --> 00:01:28,960 The easiest way to understand this problem of over fitting is visually. 16 00:01:28,960 --> 00:01:32,830 So I'm going to show you a series of chance to build out this intuition. 17 00:01:32,830 --> 00:01:39,520 Suppose we've got some sample data here we've got some x's and some Y's and our goal is to plot some 18 00:01:39,520 --> 00:01:41,040 regression line. 19 00:01:41,050 --> 00:01:48,150 Now if we try to draw a line that connected as many of these points as possible then you're going to 20 00:01:48,150 --> 00:01:49,980 end up with a really squiggly line. 21 00:01:50,380 --> 00:01:53,280 Maybe the line would end up looking like this. 22 00:01:53,500 --> 00:02:00,700 And yes this line minimizes the distance between itself and all the points but it's really looking quite 23 00:02:00,790 --> 00:02:01,750 complex. 24 00:02:01,750 --> 00:02:02,680 That's the first thing. 25 00:02:03,070 --> 00:02:07,560 And the second thing is that it's not looking terribly realistic. 26 00:02:07,750 --> 00:02:13,120 And this lack of realism actually becomes clear once more data points are collected and added to this 27 00:02:13,120 --> 00:02:14,320 plot. 28 00:02:14,320 --> 00:02:20,950 If we do that this line starts to fit the new data rather poorly meaning this model is a clear example 29 00:02:21,010 --> 00:02:23,220 of over fitting. 30 00:02:23,380 --> 00:02:29,890 But conversely what this also means is that collecting more data is actually a good way to combat or 31 00:02:29,890 --> 00:02:31,650 fitting in the first place. 32 00:02:31,750 --> 00:02:38,050 The thing is over fitting is actually a problem that is present across many machine learning techniques 33 00:02:38,380 --> 00:02:46,030 not just neural networks but neural networks are actually particularly prone to over fitting. 34 00:02:46,030 --> 00:02:47,000 Why. 35 00:02:47,020 --> 00:02:50,710 Well neural networks tend to have many many parameters. 36 00:02:50,710 --> 00:02:54,200 Our model alone had around 400000. 37 00:02:54,220 --> 00:03:00,060 Compare that with a very complex regression with say 100 variables or 20. 38 00:03:00,280 --> 00:03:06,190 In effect the more complex the model the more prone it is to over fitting. 39 00:03:06,190 --> 00:03:12,970 Neural networks are actually able to learn the training data especially well so if this is an example 40 00:03:13,210 --> 00:03:18,730 of a model over fitting the data what does the opposite look like. 41 00:03:18,920 --> 00:03:23,050 What would under fitting looked like in the regression context. 42 00:03:23,050 --> 00:03:25,360 Well it would look something like this. 43 00:03:25,420 --> 00:03:31,720 In this case the model was not really learning the relationship present in the training data well enough. 44 00:03:31,750 --> 00:03:38,040 This is a good analogy to what we had when we only train our neural network for one epoch. 45 00:03:38,230 --> 00:03:43,180 What you would want ideally is actually a balance you would want your model to learn the relationships 46 00:03:43,330 --> 00:03:47,210 that are present in the data but you don't want the model to over fit. 47 00:03:47,530 --> 00:03:53,170 Now just to show you another example Suppose you had a classification problem instead of a regression 48 00:03:53,170 --> 00:03:59,300 problem what would overfishing look like in this case with a classification problem. 49 00:03:59,350 --> 00:04:05,620 It's actually the decision boundary that starts to look rather strange and very esoteric when over fitting 50 00:04:05,620 --> 00:04:10,670 is present the Green Line is a model that's over fitting to the data. 51 00:04:10,870 --> 00:04:17,380 It's trying very hard to divide up those two groups and as a consequence the model is overreacting to 52 00:04:17,380 --> 00:04:23,260 the noise that's present in the training data set a much better decision boundary would be something 53 00:04:23,260 --> 00:04:30,550 like the black line the black line is likely to generalize much better and gives better results on the 54 00:04:30,550 --> 00:04:33,830 validation and the test data sets. 55 00:04:33,860 --> 00:04:40,280 So now that we know what over fitting and under fitting is the question is how do we diagnose the problems 56 00:04:40,370 --> 00:04:41,540 and how do we prevent them. 57 00:04:42,650 --> 00:04:48,770 Well we've already talked about one way of diagnosing over fitting is looking at the performance on 58 00:04:48,770 --> 00:04:56,720 the validation data set is the loss on the validation data set still decreasing or is it no longer decreasing 59 00:04:56,840 --> 00:04:58,720 or is it even increasing. 60 00:04:58,760 --> 00:05:01,430 These are the kind of signs to watch out for. 61 00:05:01,490 --> 00:05:05,220 Also what's the performance on the validation data set. 62 00:05:05,240 --> 00:05:08,510 How is the accuracy changing as we're training the model. 63 00:05:08,510 --> 00:05:14,510 And is there a big difference between the accuracy on the training data set versus the accuracy on the 64 00:05:14,510 --> 00:05:16,000 validation data set. 65 00:05:16,010 --> 00:05:19,040 These are the kind of signs that you need to keep an eye out for. 66 00:05:19,970 --> 00:05:23,310 Now suppose that we've diagnosed this problem for a fitting. 67 00:05:23,360 --> 00:05:25,090 How do we fix it. 68 00:05:25,100 --> 00:05:33,050 Well the techniques for preventing over fitting are called regularization and regularization actually 69 00:05:33,050 --> 00:05:40,580 encompasses a variety of techniques all of them tend to impose some sort of constraint or some sort 70 00:05:40,580 --> 00:05:43,600 of penalty on complexity. 71 00:05:43,640 --> 00:05:50,360 The theoretical justification for penalizing complexity is that a simpler solution is preferable to 72 00:05:50,360 --> 00:05:51,570 a complex one. 73 00:05:51,810 --> 00:05:55,670 And this is actually very much in line with the Zen of Python remember. 74 00:05:56,000 --> 00:06:01,400 Simple is better than complex and complex is better than complicated. 75 00:06:01,400 --> 00:06:08,830 Looking at the validation loss on tensor board one regularization technique already seems obvious. 76 00:06:08,900 --> 00:06:15,620 If the problem is is that we're training our model over too many epochs then all we need to do is to 77 00:06:15,620 --> 00:06:17,300 stop overtraining it right. 78 00:06:17,300 --> 00:06:18,630 Case closed. 79 00:06:18,770 --> 00:06:25,640 Why not stop if the validation accuracy is no longer going up and our validation error is no longer 80 00:06:25,640 --> 00:06:27,380 decreasing. 81 00:06:27,380 --> 00:06:31,690 And the answer is that yes this is actually a completely valid technique. 82 00:06:31,730 --> 00:06:33,870 This technique even has a name. 83 00:06:33,980 --> 00:06:39,830 It's called early stopping and if you wanted to implemented carries actually gives us the option to 84 00:06:39,830 --> 00:06:42,580 do so in the callback section. 85 00:06:42,710 --> 00:06:48,380 You can actually search for early stopping and there you can find that you can stop the training when 86 00:06:48,380 --> 00:06:54,410 a monitored quantity like see the validation loss has stopped improving. 87 00:06:54,410 --> 00:07:01,330 So all we would need to do is at early stopping as another callback to our list of callbacks right here 88 00:07:02,390 --> 00:07:07,850 but in addition to early stopping there is another very very powerful technique that I'd like to show 89 00:07:07,850 --> 00:07:15,560 you and this regularization technique is called dropout the dropout technique was actually published 90 00:07:15,560 --> 00:07:19,630 by a team of researchers in 2014. 91 00:07:19,940 --> 00:07:26,210 They found that if you randomly ignore some of the neurons during the training then you can reduce over 92 00:07:26,210 --> 00:07:27,230 fitting. 93 00:07:27,230 --> 00:07:33,560 In other words during each training step some random neuron either in the input layer or in the hidden 94 00:07:33,560 --> 00:07:36,000 layers is not considered. 95 00:07:36,120 --> 00:07:37,610 Say this is our neural network. 96 00:07:38,240 --> 00:07:45,110 If we apply the drop out technique to the input layer then we can specify a chance for every single 97 00:07:45,110 --> 00:07:49,460 one of these neurons to not be considered during the training. 98 00:07:49,490 --> 00:07:56,090 So if there's a 20 percent chance that each of these neurons can drop out then during the first training 99 00:07:56,090 --> 00:08:02,130 step maybe the first neuron and all of its connections will be ignored. 100 00:08:02,240 --> 00:08:09,950 If this neuron and all of its connections drop out then the network shrinks it becomes a less complex 101 00:08:09,950 --> 00:08:15,080 network because there are fewer connections during the next training step. 102 00:08:15,080 --> 00:08:20,450 A different neuron might drop out the neuron that dropped out during the first training step will come 103 00:08:20,450 --> 00:08:25,400 back and a different one might drop out with a 20 percent probability. 104 00:08:25,400 --> 00:08:28,340 Now this technique might seem rather strange right. 105 00:08:28,340 --> 00:08:29,830 Why does this work. 106 00:08:29,870 --> 00:08:35,840 Well it means that the connected neurons all the neurons that are downstream in that first hidden layer 107 00:08:36,410 --> 00:08:40,160 now don't want to rely too heavily on any single input. 108 00:08:40,700 --> 00:08:47,240 If a random neuron drops out during the training step every time all of the connected neurons will try 109 00:08:47,240 --> 00:08:54,110 to hedge themselves in order not to weigh any particular input too heavily and this will help prevent 110 00:08:54,170 --> 00:08:55,050 over fitting. 111 00:08:55,790 --> 00:09:02,510 But let me ask you this with a network that implements the drop out technique learn differently than 112 00:09:02,510 --> 00:09:05,740 a network that has no dropout. 113 00:09:05,780 --> 00:09:07,070 What about speed. 114 00:09:07,070 --> 00:09:14,400 Will it learn just as quickly or will it take more time to answer these questions let's dive back into 115 00:09:14,400 --> 00:09:15,210 our code. 116 00:09:15,210 --> 00:09:22,230 Let's create a second model in Jupiter and find out because we can use tensor board to compare our second 117 00:09:22,230 --> 00:09:26,630 model with our first model and see how they learn differently. 118 00:09:26,640 --> 00:09:31,380 Let's go to this section here in our Jupiter notebook where we're defining our model. 119 00:09:31,440 --> 00:09:38,020 I'm going to add another cell below and here I'm going to create our model underscored two. 120 00:09:38,100 --> 00:09:42,530 And this is also going to be a sequential model. 121 00:09:42,660 --> 00:09:49,360 Now previously we've added all the layers inside of these square brackets of the model. 122 00:09:49,470 --> 00:09:52,300 But this isn't the only way we can do things. 123 00:09:52,440 --> 00:09:58,860 In fact when we go to the getting started guide we see that Chris also has this add method to add the 124 00:09:58,860 --> 00:10:01,230 different layers to the model. 125 00:10:01,230 --> 00:10:07,710 So after you've created the model you can call dot add and then supply the layer between the parentheses 126 00:10:07,950 --> 00:10:10,370 that you want to add to the model. 127 00:10:10,440 --> 00:10:16,740 So this isn't an alternative way to write your code in Caris and let's try this out since we might often 128 00:10:16,740 --> 00:10:21,100 see this on blog posts or on stack overflow. 129 00:10:21,240 --> 00:10:26,580 The first thing I'm actually going to do is recreate all of these layers in Model Number Two. 130 00:10:26,580 --> 00:10:32,800 So I'll say model underscore to dot and and I'll add my first dense layer. 131 00:10:32,940 --> 00:10:40,630 This is going to be 128 units with the activation being redo. 132 00:10:40,650 --> 00:10:42,570 Now I'm going to give this layer a name. 133 00:10:42,570 --> 00:10:48,830 I'm going to call it m to underscore hidden one. 134 00:10:48,910 --> 00:10:51,490 Now when I had the other layers very quickly. 135 00:10:51,600 --> 00:11:04,260 So just copy this line paste paste paste and I want to change the inputs here to 64 16 and 10. 136 00:11:04,320 --> 00:11:12,090 The last activation should be soft Max and the layer names of course have to be updated as well. 137 00:11:12,090 --> 00:11:15,640 So this is my output layer output. 138 00:11:15,870 --> 00:11:17,420 This is hidden 3. 139 00:11:17,430 --> 00:11:18,990 This is hidden too. 140 00:11:19,050 --> 00:11:20,630 And there we go. 141 00:11:20,640 --> 00:11:23,790 I've added all my layers as before. 142 00:11:23,790 --> 00:11:29,080 So this code and this code is actually equivalent. 143 00:11:29,130 --> 00:11:33,410 The only thing that's left to do is to add my drop out. 144 00:11:33,720 --> 00:11:35,640 Now here's how I'm going to do it. 145 00:11:35,640 --> 00:11:38,880 First I'm going to check the caris documentation. 146 00:11:39,300 --> 00:11:43,490 So under layers I can see dents which I've used before. 147 00:11:44,160 --> 00:11:53,580 And I can scroll down and there I will find dropout my dropout layer will require some probability between 148 00:11:53,580 --> 00:12:02,220 zero and one as the dropout probability and the convention is that numbers between 20 percent and 50 149 00:12:02,220 --> 00:12:09,750 percent work rather well we're going to try out 20 percent if you'd like to have the same random neurons 150 00:12:09,750 --> 00:12:10,840 drop out as me. 151 00:12:11,010 --> 00:12:13,550 Then we're going to set the same seed. 152 00:12:14,040 --> 00:12:16,720 So let's implement this back in our notebook. 153 00:12:16,890 --> 00:12:26,010 The first thing we're gonna do is we're actually going to input dropout backup on our import statements 154 00:12:26,150 --> 00:12:27,840 under Kerry's start layers. 155 00:12:27,840 --> 00:12:31,610 We're going to import dense activation and crop out. 156 00:12:31,680 --> 00:12:34,500 So shift enter on the cell. 157 00:12:34,500 --> 00:12:42,150 Now we can scroll down to where we created our second model and write model underscore to dot ad and 158 00:12:42,150 --> 00:12:44,400 then drop out. 159 00:12:44,400 --> 00:12:51,810 Now our dropout probability which we said was gonna be zero point two and our seed is gonna be equal 160 00:12:51,810 --> 00:12:58,950 to say forty two but there is one more thing we have to specify because this will kind of act as the 161 00:12:58,950 --> 00:13:00,710 first layer in our model. 162 00:13:00,810 --> 00:13:08,820 We have to specify the input shape for our dense layer the parameter name was input underscore DRM four 163 00:13:08,820 --> 00:13:16,520 dimensions but for our dropout the name is actually can I read input on a score shape. 164 00:13:16,800 --> 00:13:27,680 And this will be equal to a tuple so parentheses and then total on a score inputs and then a comma afterwards. 165 00:13:27,810 --> 00:13:34,450 And that is how we're going to specify the total number of inputs for our dropout. 166 00:13:34,470 --> 00:13:40,990 Having added this code we're going to apply our dropout to our input later. 167 00:13:41,030 --> 00:13:44,240 The only thing left to do is to compile our model. 168 00:13:44,240 --> 00:13:44,970 Right. 169 00:13:45,170 --> 00:13:48,170 And we can actually do this by copying this code. 170 00:13:48,170 --> 00:13:53,900 We're going to use the same optimizer can use the same loss when we use the same metrics so I kind of 171 00:13:53,910 --> 00:14:02,710 pasted in here and I going to say model to dot compile fantastic let's head shift into here. 172 00:14:03,320 --> 00:14:13,100 And now we can come down here and we can copy the cell paste it and we can change model one to model 173 00:14:13,140 --> 00:14:22,160 two can change the name for tensor board from model 1 to model 2 and then we can finally fit our model 174 00:14:22,280 --> 00:14:26,630 and see how it will behave compared to our model without the dropout. 175 00:14:27,440 --> 00:14:28,790 So let's try it out. 176 00:14:28,910 --> 00:14:30,300 I'm very curious already. 177 00:14:30,410 --> 00:14:37,650 That's head shift enter on the cell and switch over to tensor board and let's play the waiting game. 178 00:14:37,910 --> 00:14:44,570 If I scroll down here in the bottom left corner then I can see model to add 16 0 8. 179 00:14:44,570 --> 00:14:45,660 So here we go. 180 00:14:45,860 --> 00:14:48,880 We can see that the loss is coming down. 181 00:14:48,920 --> 00:14:50,160 This is still an epoch. 182 00:14:50,180 --> 00:15:00,790 14 and as the model continues to run the validation loss continues to decrease and then it has a bit 183 00:15:00,790 --> 00:15:05,950 of a wobble here and then it's slowly ticking up again by around epoch. 184 00:15:05,970 --> 00:15:08,330 One hundred and thirty. 185 00:15:08,350 --> 00:15:15,560 So what we're seeing here is that the drop out does indeed reduce over fitting but it will not eliminate 186 00:15:15,560 --> 00:15:16,960 it completely. 187 00:15:16,960 --> 00:15:23,110 We do have to use some combination of early stopping stopping to train our model at a certain point 188 00:15:23,680 --> 00:15:27,430 and the drop out technique to get a better result. 189 00:15:27,430 --> 00:15:31,300 Now let's take a look at our validation accuracy. 190 00:15:31,300 --> 00:15:36,080 One thing that's quite interesting is that we actually haven't lost out on much accuracy. 191 00:15:36,370 --> 00:15:43,630 Model number two with the dropout and model number one without the dropout are around 30 percent accurate 192 00:15:43,900 --> 00:15:45,910 by the end of the training. 193 00:15:46,060 --> 00:15:51,500 One thing that is very noticeable is that the dropout model tends to learn a bit more slowly. 194 00:15:51,520 --> 00:15:57,310 It tends to learn the training data set slower than the model without the dropout and this is exactly 195 00:15:57,310 --> 00:15:58,430 what we intended. 196 00:15:58,540 --> 00:16:05,130 By the way if there are too many crafts on him then you can toggle them down here on the left hand side. 197 00:16:05,140 --> 00:16:09,880 Suppose I just want to see model number two then I can hit this radio button here and if I want to do 198 00:16:09,880 --> 00:16:17,170 a little horse race with Model 1 and model 2 then I can take the individual tick boxes here to get them 199 00:16:17,230 --> 00:16:25,590 side by side like so on the chart remember high said that we could apply the dropout to both the input 200 00:16:25,590 --> 00:16:27,630 layer as well as to a hidden layer. 201 00:16:28,440 --> 00:16:31,290 Well I've now got a challenge for you. 202 00:16:31,290 --> 00:16:34,580 What I'd like you to do is create a third model. 203 00:16:34,680 --> 00:16:39,020 Call it model on a score three that has to dropout layers. 204 00:16:39,060 --> 00:16:42,270 The first dropout layer should be the same as with Model Number Two. 205 00:16:42,540 --> 00:16:49,170 So it should be applied on the inputs but the second dropout layer should be added after the first hidden 206 00:16:49,170 --> 00:16:49,770 layer. 207 00:16:49,860 --> 00:16:53,510 And have a dropout rate of 25 percent. 208 00:16:53,550 --> 00:17:00,640 I'll give you a few seconds to pause the video and give this a go. 209 00:17:00,680 --> 00:17:01,070 Ready. 210 00:17:02,060 --> 00:17:03,590 Here's the solution. 211 00:17:03,590 --> 00:17:12,080 So I'm just going to take this cell copy it and paste it move it down and I want to change all the references 212 00:17:12,410 --> 00:17:21,460 from Hoddle to to model 3 and after I've done that allowed the second dropout layer here model on a 213 00:17:21,500 --> 00:17:28,710 school three dot and dropout zero point two five. 214 00:17:28,720 --> 00:17:35,990 Now you're free to use the same seed as before but in this case you don't actually have to add anything 215 00:17:36,380 --> 00:17:43,970 afterwards because Chris is smart enough to infer the number of inputs on this dropout layer because 216 00:17:43,970 --> 00:17:47,350 it is not the first layer and that's it. 217 00:17:48,050 --> 00:17:50,170 It's all done. 218 00:17:50,180 --> 00:17:56,270 Now we're going to get to the very interesting part we're actually going to train our model on our entire 219 00:17:56,270 --> 00:18:01,160 training dataset and no longer this extra small training dataset that we've had before. 220 00:18:01,670 --> 00:18:08,030 We're going to see what a difference it makes when we're moving from a training data set of 1000 samples 221 00:18:08,360 --> 00:18:12,120 to 40000 samples so check it out. 222 00:18:12,720 --> 00:18:16,440 I'm going to copy this cell pasted below. 223 00:18:16,530 --> 00:18:20,280 I'm going to take this one here and comment it out. 224 00:18:20,280 --> 00:18:26,640 I'll keep it around for reference but I'm going to take model one and we're going to reduce the number 225 00:18:26,640 --> 00:18:28,880 of epochs we're gonna train this model on. 226 00:18:29,010 --> 00:18:36,060 We're gonna go down from one hundred and fifty to one hundred and then we'll change our training dataset 227 00:18:36,390 --> 00:18:45,420 to X on this train and Y underscore train and we're going to call this model model 1 x l That way we 228 00:18:45,420 --> 00:18:52,660 know we're dealing with the large dataset I'll copy the cell I'll paste it again and here we're gonna 229 00:18:52,680 --> 00:19:02,400 change this to model 2 and Model 2 right here and copy that paste it again and where to change this 230 00:19:02,490 --> 00:19:08,980 to Model 3 and when change in this model name to Model 3 as well. 231 00:19:09,480 --> 00:19:16,950 This last cell here I'm going to comment out and I want to move this up slightly and now I should be 232 00:19:16,950 --> 00:19:27,040 left with these three cells I'm very very quickly going to come up here and recompile Model 1 and 2 233 00:19:27,040 --> 00:19:33,070 before I run my code so that I know that they're all gonna be starting from scratch and the last thing 234 00:19:33,070 --> 00:19:39,910 I want to do is I'm going to go into tensor board and here I'm actually only going to keep these two 235 00:19:39,910 --> 00:19:47,410 lines around I basically want to delete all these other runs that we've had the way I want to do this 236 00:19:47,530 --> 00:19:53,260 is I wanted to leave my files directly so the two models that I want to keep around are model one at 237 00:19:53,260 --> 00:20:00,760 eleven fifty six and Model 2 at 16 0 8 and the other ones I'm no longer that interested in these are 238 00:20:00,760 --> 00:20:10,150 all legacy runs and then I'm going to come back into Jupiter and I'm going to run this cell run the 239 00:20:10,150 --> 00:20:18,490 cell and run this cell now my computer is really working hard it's kind of crunching a lot of numbers 240 00:20:18,910 --> 00:20:26,980 and we can go into tenths aboard and we can refresh the entire thing so that it will redraw charts and 241 00:20:27,040 --> 00:20:35,260 we can watch but I think it's actually much more advisable to get a coffee at this point because this 242 00:20:35,260 --> 00:20:42,030 could really take a while one of the things though that's gonna be obvious almost immediately is the 243 00:20:42,030 --> 00:20:48,390 massive improvement in the validation accuracy that you're going to be getting from your model as soon 244 00:20:48,390 --> 00:20:57,030 as you add more data when we only had 1000 training examples we were doing very very poorly on the validation 245 00:20:57,030 --> 00:21:03,240 data set and the number of epochs that we're using to train our model wasn't able to even make that 246 00:21:03,240 --> 00:21:10,650 much of a difference but look at what happens with version one of our model when we went from 1000 examples 247 00:21:10,860 --> 00:21:19,280 to 40000 examples this is really why people say that data is the new oil the more data the better. 248 00:21:20,320 --> 00:21:27,780 All right now all my three versions of my model have finished training and you can see that the first 249 00:21:27,780 --> 00:21:34,590 one took four minutes the next one took about six minutes and the last one took me about eight minutes. 250 00:21:34,590 --> 00:21:41,520 Model Number Two had dropped out on the input layer and model number three had dropped on both the input 251 00:21:41,520 --> 00:21:43,890 layer and the first hidden layer. 252 00:21:44,550 --> 00:21:51,490 Let's compare how these models did in 10 support the first thing that we're looking at is the accuracy 253 00:21:51,580 --> 00:21:53,600 on the training data set. 254 00:21:53,890 --> 00:21:56,710 And here what we see is actually quite interesting. 255 00:21:56,770 --> 00:22:02,590 So the red and orange lines are the models on the small training data set and the light blue pink and 256 00:22:02,590 --> 00:22:06,820 dark red lines are on the larger training data set. 257 00:22:06,820 --> 00:22:12,910 What we see is that model number three with the most amount of drop out learned the training data said 258 00:22:13,210 --> 00:22:15,040 the most slowly. 259 00:22:15,040 --> 00:22:21,550 This is especially obvious if you look at what was happening from epoch number 50 onwards that pink 260 00:22:21,550 --> 00:22:28,150 line is far below the other two I think model number two got off to a very strange start. 261 00:22:28,180 --> 00:22:30,590 Initially it seemed not to learn very much at all. 262 00:22:30,790 --> 00:22:35,420 And then the accuracy improved dramatically within a few epochs. 263 00:22:35,440 --> 00:22:40,950 Now I suspect this might be due to the weight initialization at the start. 264 00:22:40,980 --> 00:22:46,550 There is some randomness to the starting point for these algorithms with the images. 265 00:22:46,600 --> 00:22:52,840 So if we were to fit this model again we might get a slightly different line but nonetheless we see 266 00:22:52,840 --> 00:23:01,950 that models with more drop out learn more slowly the next jaunt intensive board shows us the training 267 00:23:01,950 --> 00:23:02,760 loss. 268 00:23:02,760 --> 00:23:08,610 So here we see that for every single run that we've done all of the models reduce the training loss 269 00:23:08,760 --> 00:23:10,220 the more they train. 270 00:23:10,260 --> 00:23:15,030 This is expected and also not that interesting. 271 00:23:15,030 --> 00:23:18,860 So let's go down to the validation accuracy. 272 00:23:18,900 --> 00:23:25,050 This is much more interesting because here we see the dramatic difference between the models that were 273 00:23:25,050 --> 00:23:31,210 trained on a small data set of a thousand samples and the models that were trained on the larger data 274 00:23:31,210 --> 00:23:34,300 set of 40000 samples. 275 00:23:34,300 --> 00:23:42,210 Now mind you this is the validation accuracy on the 10000 samples in the validation data set. 276 00:23:42,480 --> 00:23:51,960 And what we see is that model 1 2 and 3 all end up between 47 and 49 percent accuracy on the validation 277 00:23:51,960 --> 00:23:53,000 data set. 278 00:23:53,130 --> 00:23:58,400 But what we also see is that the accuracy didn't really change much towards the end. 279 00:23:58,530 --> 00:24:02,490 And this can indicate overtraining and over fitting. 280 00:24:02,490 --> 00:24:10,800 Let's scroll down to the validation loss and see if this is true and what we see here is that model 281 00:24:10,800 --> 00:24:18,270 2 and 3 which have the dropout feature have a validation loss which continues to decrease although very 282 00:24:18,270 --> 00:24:21,710 very slowly almost right up to the end. 283 00:24:22,200 --> 00:24:28,080 Model number one on the other hand has a validation loss that's pretty much static after epoch number 284 00:24:28,080 --> 00:24:29,180 50. 285 00:24:29,370 --> 00:24:35,580 And this to me suggests that we should probably stop training model number one a little earlier so where 286 00:24:35,580 --> 00:24:36,420 does this leave us. 287 00:24:36,960 --> 00:24:40,440 Well there's a couple of things that we found out. 288 00:24:40,440 --> 00:24:46,770 One is that the amount of data makes a massive difference increasing the amount of data in the training 289 00:24:46,820 --> 00:24:54,570 dataset can help reduce overfishing dramatically and it will also boost the accuracy on data that the 290 00:24:54,570 --> 00:24:56,610 model hasn't seen before. 291 00:24:56,610 --> 00:25:02,130 The second thing that we've seen is that early stopping is a useful technique for preventing over fitting 292 00:25:03,090 --> 00:25:08,010 training our models are more and more epochs is not necessarily a good thing. 293 00:25:08,010 --> 00:25:15,180 The third thing that we've seen is that drop out is a useful technique for reducing over fitting models 294 00:25:15,210 --> 00:25:20,250 that do implement the drop out technique learn a little bit more slowly but their validation errors 295 00:25:20,430 --> 00:25:26,610 creep up much later than those without drop out and the fourth thing that we saw is a little bit more 296 00:25:26,610 --> 00:25:34,920 on the magnitude of these impacts increasing the amount of data 40 fold had a much bigger impact than 297 00:25:34,920 --> 00:25:41,420 adding one or two drop out layers to the structure of our model judging by the validation accuracy. 298 00:25:41,650 --> 00:25:47,520 At the moment our models are getting approximately 50 percent of the classifications correct. 299 00:25:47,590 --> 00:25:53,470 Now if this was a spam filter this would be terrible because in that case an email is other spam or 300 00:25:53,470 --> 00:25:54,670 not spam. 301 00:25:54,670 --> 00:25:59,440 But in this case we're actually differentiating between 10 different things and getting 50 percent of 302 00:25:59,440 --> 00:26:04,870 those classifications correct doesn't actually seem to be such a bad thing especially if this is our 303 00:26:04,870 --> 00:26:07,810 first go in the next lessons. 304 00:26:07,880 --> 00:26:11,480 We're going to be evaluating these models a little bit more closely. 305 00:26:11,570 --> 00:26:17,270 We're going to take a hard look at the predictions these models make on individual images and we're 306 00:26:17,270 --> 00:26:23,670 going to evaluate how our favorite model fares on the test data center for all of that and more. 307 00:26:23,750 --> 00:26:25,340 I'll see you in the next lesson. 308 00:26:25,340 --> 00:26:25,900 Take care.